This is a dataset from a website called InsideAirbnb.com. It is a dataset that was scraped on December 30 and 31, 2022 from Airbnb for the previous 12 months’ listings for Denver, CO.
The original dataset has 3050 rows and 32 variables. The variables were all normalized because of their vast differences in scale (some variables had values in the thousands; others had a maximum value of 5). The number of normalized variables was 105.
Analyses were done to predict two variables. Price was the continuous variable, and value_score_cat was the categorical variable. There were two values for value_score_cat. A perfect score of 5.0 was coded as “excellent”, and anything else was coded as “other”.
The following methodology was undertaken for each of the variables: • Examine the distribution of the variables to look for relationships • Conduct simple regression analyses for baselines • Normalize all variables • Employ lasso regression to narrow down the number of variables • Build one random forest model with all predictors and a second one with only the lasso predictors
Number of baths ended up being the most important predictor of price, by a huge margin. Number of baths is a good proxy for property size: more baths suggests larger property. Price being higher for a larger property makes sense.
Looking at the analyses as a whole, the three most important factors that a host could change are minimum nights, review scores rating, and maximum nights. We see from the regression coefficients that having a lower number of minimum nights, a higher number of maximum nights, and a higher review scores rating all increase price. The recommendations for a host would be to be flexible (allow for both shorter and longer stays) and to work hard to please the customers.
Review scores rating is the most important predictor of value score.
Interestingly, review scores rating is negatively correlated with value score. This would mean that higher-rated properties overall are rated as having lower value for the price. The other two significant predictors from the lasso regression are the number of reviews in the last 12 months and the number of months since the first review of the property. These are both positively correlated to value score, which means that properties that have been airbnbs longer and which have had more customers recently receive better ratings for value.
The takeaway from the negative correlation between review scores rating and value score may be that, even though customers have a great overall experience at an airbnb, they may believe that it was overpriced.
More research should be done before offering airbnb hosts advice on this matter, however.
This is a dataset from a website called InsideAirbnb.com. It is a dataset that was scraped on December 30 and 31, 2022 from Airbnb for the previous 12 months’ listings for Denver, CO. The website is http://insideairbnb.com/get-the-data. Variables from the downloaded detailed dataset have been transformed in R in order to conduct the analyses. The distribution of variables will be examined first in order to look for relationships. Regression analysis will be performed in order to predict a property’s per-night price. Following normalization of the variables, lasso regression will be employed to narrow down the number of variables, as well. One random forest model will be run with all predictors and a second one with the lasso predictors. A classification analysis will be used to predict a customer’s rating of the value of the property (whether the property is a good value for the price paid). Because the ratings skew high, there were two ratings assigned: ‘excellent’ for a perfect rating and ‘other’ for anything less. Lasso regression will again be used in order to narrow down the number of predictors, followed by logistic regression. One random forest model will be run with all predictors and a second one with the lasso predictors. Finally, a summary of conclusions will be presented.
This dataset has 3050 rows and 32 variables.
This is a dataset from a website called InsideAirbnb.com. It is a dataset that was scraped on December 30 and 31, 2022 from Airbnb for the previous 12 months’ listings for Denver, CO. The website is http://insideairbnb.com/get-the-data.
VARIABLES TO PREDICT WITH
VARIABLES WE WANT TO PREDICT
host_in_denver host_response_time host_response_rate
no : 384 a few days or more: 2 Min. : 17.00
yes:2666 within a day : 147 1st Qu.:100.00
within a few hours: 287 Median :100.00
within an hour :2614 Mean : 98.92
3rd Qu.:100.00
Max. :100.00
host_acceptance_rate host_is_superhost neighborhood latitude
Min. : 20.00 no :1047 Five Points : 258 Min. :39.63
1st Qu.: 94.00 yes:2003 Highland : 238 1st Qu.:39.73
Median : 99.00 West Colfax : 139 Median :39.75
Mean : 94.27 West Highland: 117 Mean :39.74
3rd Qu.:100.00 Union Station: 108 3rd Qu.:39.76
Max. :100.00 Berkeley : 105 Max. :39.82
(Other) :2085
longitude room_type max_guests bedrooms
Min. :-105.1 Entire home/apt:2581 Min. : 1.000 Min. :1.000
1st Qu.:-105.0 Hotel room : 7 1st Qu.: 2.000 1st Qu.:1.000
Median :-105.0 Private room : 442 Median : 4.000 Median :2.000
Mean :-105.0 Shared room : 20 Mean : 4.265 Mean :1.818
3rd Qu.:-105.0 3rd Qu.: 6.000 3rd Qu.:2.000
Max. :-104.7 Max. :16.000 Max. :9.000
beds price min_nights max_nights
Min. : 1.000 Min. : 15.0 Min. : 1.00 Min. : 1.0
1st Qu.: 1.000 1st Qu.: 89.0 1st Qu.: 2.00 1st Qu.: 365.0
Median : 2.000 Median : 125.0 Median : 2.90 Median : 1125.0
Mean : 2.292 Mean : 167.8 Mean : 18.48 Mean : 866.9
3rd Qu.: 3.000 3rd Qu.: 189.0 3rd Qu.: 29.00 3rd Qu.: 1125.0
Max. :14.000 Max. :2614.0 Max. :365.00 Max. :10000.0
number_of_reviews number_of_reviews_ltm review_scores_rating
Min. : 1.00 Min. : 0.00 Min. :1.000
1st Qu.: 6.00 1st Qu.: 2.00 1st Qu.:4.820
Median : 28.00 Median : 10.00 Median :4.930
Mean : 68.88 Mean : 20.13 Mean :4.852
3rd Qu.: 86.00 3rd Qu.: 31.00 3rd Qu.:5.000
Max. :1338.00 Max. :400.00 Max. :5.000
review_scores_accuracy review_scores_cleanliness review_scores_checkin
Min. :1.00 Min. :1.00 Min. :1.000
1st Qu.:4.87 1st Qu.:4.82 1st Qu.:4.920
Median :4.95 Median :4.94 Median :4.980
Mean :4.88 Mean :4.85 Mean :4.917
3rd Qu.:5.00 3rd Qu.:5.00 3rd Qu.:5.000
Max. :5.00 Max. :5.00 Max. :5.000
review_scores_communication review_scores_location review_scores_value
Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:4.920 1st Qu.:4.840 1st Qu.:4.730
Median :4.980 Median :4.930 Median :4.850
Mean :4.905 Mean :4.866 Mean :4.778
3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:4.940
Max. :5.000 Max. :5.000 Max. :5.000
calculated_host_listings_count reviews_per_month host_tenure
Min. : 1.0 Min. : 0.020 Min. : 1.324
1st Qu.: 1.0 1st Qu.: 0.510 1st Qu.: 69.077
Median : 1.0 Median : 1.590 Median : 84.511
Mean : 11.8 Mean : 2.187 Mean : 82.707
3rd Qu.: 4.0 3rd Qu.: 3.230 3rd Qu.:102.482
Max. :243.0 Max. :40.000 Max. :173.808
since_first_review since_last_review num_bath bath_type
Min. : 0.04244 Min. : 0.04244 Min. :0.500 private:2855
1st Qu.: 7.56605 1st Qu.: 0.56810 1st Qu.:1.000 shared : 195
Median : 19.39357 Median : 1.29090 Median :1.000
Mean : 30.36135 Mean : 3.27590 Mean :1.499
3rd Qu.: 48.32991 3rd Qu.: 3.03217 3rd Qu.:2.000
Max. :166.28474 Max. :132.60917 Max. :9.500
value_score_cat
excellent: 603
other :2447
max_guests (Maximum number of guests
at one time)| max_guests | n | mean(price) |
|---|---|---|
| 1 | 106 | 51.85 |
| 2 | 918 | 104.55 |
| 3 | 231 | 117.15 |
| 4 | 823 | 142.26 |
| 5 | 162 | 169.94 |
| 6 | 398 | 228.21 |
| 7 | 56 | 293.89 |
| 8 | 154 | 292.26 |
| 9 | 21 | 278.29 |
| 10 | 89 | 339.85 |
| 11 | 13 | 413.15 |
| 12 | 44 | 502.73 |
| 13 | 7 | 532.00 |
| 14 | 12 | 686.58 |
| 15 | 5 | 363.60 |
| 16 | 11 | 825.82 |
Unsurprisingly, we see that the price values are very right-skewed. Although the median price is $125, the maximum price is $2614. Of the continuous predictors, max_guests, beds, bedrooms, and num_bath have the highest correlations with price. That would be logical, since a property that can accommodate more guests would likely command a higher price.
We see that the ‘excellent’ value score category makes up approximately 20% of the total. One interesting finding is that non-superhosts outnumber superhosts in the ‘excellent’ value score category, even though the percentage of superhosts is much higher than the percentage of hosts who are not superhosts.
After finding collinearity between beds, bedrooms, num_baths, and max_guests, individual regressions were run between price and the four collinear variables in order to select the one with the largest coefficient. Num_bath, having the highest coefficient of 127.8, was retained in the dataset. The other three variables were eliminated.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 14.411 | 4.799 | 3.003 | 0.003 |
| max_guests | 35.970 | 0.961 | 37.411 | 0.000 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -23.688 | 5.005 | -4.733 | 0 |
| num_bath | 127.807 | 2.943 | 43.430 | 0 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -5.308 | 5.036 | -1.054 | 0.292 |
| bedrooms | 95.219 | 2.419 | 39.362 | 0.000 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 38.590 | 4.586 | 8.414 | 0 |
| beds | 56.378 | 1.659 | 33.988 | 0 |
Linear and logistic regression models were run as baseline models. All predictors were used with these initial models.
We can see from the actual-predicted plot for the linear regression that there are a number of price outliers that are likely causing problems with predictions. Given the large number of outliers, the R-square value of .448 is not surprising.
In contrast, the logistic regression seems to be fairly accurate overall, with accuracy of .912 and an AUC of .96. It is achieving this accuracy with a high specificity of .946 and a rather lower sensitivity of .776.
Here is the initial regression model predicting price using all predictors.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 60755.074 | 41260.513 | 1.472 | 0.141 |
| host_in_denveryes | 13.445 | 7.812 | 1.721 | 0.085 |
| host_response_timewithin a day | 100.638 | 99.732 | 1.009 | 0.313 |
| host_response_timewithin a few hours | 99.962 | 100.740 | 0.992 | 0.321 |
| host_response_timewithin an hour | 113.061 | 100.939 | 1.120 | 0.263 |
| host_response_rate | -0.850 | 0.562 | -1.512 | 0.131 |
| host_acceptance_rate | -0.126 | 0.235 | -0.537 | 0.592 |
| host_is_superhostyes | -13.343 | 5.532 | -2.412 | 0.016 |
| neighborhoodAuraria | -129.488 | 79.650 | -1.626 | 0.104 |
| neighborhoodBaker | -26.446 | 28.664 | -0.923 | 0.356 |
| neighborhoodBarnum | 17.528 | 39.377 | 0.445 | 0.656 |
| neighborhoodBarnum West | 3.667 | 39.808 | 0.092 | 0.927 |
| neighborhoodBear Valley | 11.676 | 80.857 | 0.144 | 0.885 |
| neighborhoodBelcaro | 364.047 | 64.303 | 5.661 | 0.000 |
| neighborhoodBerkeley | -22.361 | 44.717 | -0.500 | 0.617 |
| neighborhoodCapitol Hill | -21.097 | 33.117 | -0.637 | 0.524 |
| neighborhoodCBD | 32.263 | 34.807 | 0.927 | 0.354 |
| neighborhoodChaffee Park | -15.118 | 53.273 | -0.284 | 0.777 |
| neighborhoodCheesman Park | -35.447 | 36.379 | -0.974 | 0.330 |
| neighborhoodCherry Creek | -105.990 | 44.180 | -2.399 | 0.016 |
| neighborhoodCity Park | -54.655 | 44.770 | -1.221 | 0.222 |
| neighborhoodCity Park West | -80.398 | 37.821 | -2.126 | 0.034 |
| neighborhoodCivic Center | -131.092 | 43.657 | -3.003 | 0.003 |
| neighborhoodClayton | -84.017 | 48.443 | -1.734 | 0.083 |
| neighborhoodCole | 24.432 | 44.486 | 0.549 | 0.583 |
| neighborhoodCollege View - South Platte | 0.436 | 63.453 | 0.007 | 0.995 |
| neighborhoodCongress Park | -65.791 | 38.852 | -1.693 | 0.090 |
| neighborhoodCory - Merrill | -62.961 | 51.427 | -1.224 | 0.221 |
| neighborhoodCountry Club | -39.348 | 51.111 | -0.770 | 0.441 |
| neighborhoodDIA | -265.768 | 113.648 | -2.339 | 0.019 |
| neighborhoodEast Colfax | -137.024 | 57.979 | -2.363 | 0.018 |
| neighborhoodElyria Swansea | -80.791 | 62.249 | -1.298 | 0.194 |
| neighborhoodFive Points | -34.935 | 37.422 | -0.934 | 0.351 |
| neighborhoodFort Logan | 7.566 | 74.343 | 0.102 | 0.919 |
| neighborhoodGateway - Green Valley Ranch | -265.493 | 100.014 | -2.655 | 0.008 |
| neighborhoodGlobeville | -55.155 | 60.450 | -0.912 | 0.362 |
| neighborhoodGoldsmith | -44.727 | 58.000 | -0.771 | 0.441 |
| neighborhoodHale | -95.195 | 46.082 | -2.066 | 0.039 |
| neighborhoodHampden | -93.596 | 64.408 | -1.453 | 0.146 |
| neighborhoodHampden South | -123.880 | 62.397 | -1.985 | 0.047 |
| neighborhoodHarvey Park | 3.650 | 56.115 | 0.065 | 0.948 |
| neighborhoodHarvey Park South | 19.938 | 65.228 | 0.306 | 0.760 |
| neighborhoodHighland | -4.726 | 37.880 | -0.125 | 0.901 |
| neighborhoodHilltop | -89.985 | 59.848 | -1.504 | 0.133 |
| neighborhoodIndian Creek | -251.374 | 103.117 | -2.438 | 0.015 |
| neighborhoodJefferson Park | -38.646 | 36.428 | -1.061 | 0.289 |
| neighborhoodLincoln Park | -40.319 | 32.741 | -1.231 | 0.218 |
| neighborhoodLowry Field | -178.195 | 70.397 | -2.531 | 0.011 |
| neighborhoodMar Lee | -27.764 | 51.665 | -0.537 | 0.591 |
| neighborhoodMarston | -31.277 | 134.454 | -0.233 | 0.816 |
| neighborhoodMontbello | -184.683 | 88.385 | -2.090 | 0.037 |
| neighborhoodMontclair | -84.039 | 54.463 | -1.543 | 0.123 |
| neighborhoodNorth Capitol Hill | -36.051 | 39.319 | -0.917 | 0.359 |
| neighborhoodNorth Park Hill | -104.733 | 50.581 | -2.071 | 0.038 |
| neighborhoodNortheast Park Hill | -110.563 | 54.101 | -2.044 | 0.041 |
| neighborhoodOverland | -43.115 | 38.319 | -1.125 | 0.261 |
| neighborhoodPlatt Park | -35.478 | 32.891 | -1.079 | 0.281 |
| neighborhoodRegis | -18.087 | 53.135 | -0.340 | 0.734 |
| neighborhoodRosedale | -37.973 | 39.872 | -0.952 | 0.341 |
| neighborhoodRuby Hill | -8.124 | 36.318 | -0.224 | 0.823 |
| neighborhoodSkyland | -62.148 | 45.105 | -1.378 | 0.168 |
| neighborhoodSloan Lake | 13.762 | 36.528 | 0.377 | 0.706 |
| neighborhoodSouth Park Hill | -81.466 | 49.597 | -1.643 | 0.101 |
| neighborhoodSpeer | -38.445 | 29.333 | -1.311 | 0.190 |
| neighborhoodStapleton | -181.470 | 65.552 | -2.768 | 0.006 |
| neighborhoodSun Valley | -20.041 | 129.607 | -0.155 | 0.877 |
| neighborhoodSunnyside | -43.260 | 43.492 | -0.995 | 0.320 |
| neighborhoodUnion Station | -5.792 | 36.135 | -0.160 | 0.873 |
| neighborhoodUniversity | -13.795 | 38.585 | -0.358 | 0.721 |
| neighborhoodUniversity Hills | -86.977 | 52.684 | -1.651 | 0.099 |
| neighborhoodUniversity Park | 64.361 | 43.125 | 1.492 | 0.136 |
| neighborhoodValverde | -17.472 | 42.216 | -0.414 | 0.679 |
| neighborhoodVilla Park | -18.166 | 34.143 | -0.532 | 0.595 |
| neighborhoodVirginia Village | -80.757 | 44.736 | -1.805 | 0.071 |
| neighborhoodWashington Park | -36.897 | 38.428 | -0.960 | 0.337 |
| neighborhoodWashington Park West | -19.210 | 31.855 | -0.603 | 0.547 |
| neighborhoodWashington Virginia Vale | -121.983 | 46.910 | -2.600 | 0.009 |
| neighborhoodWellshire | -79.543 | 67.590 | -1.177 | 0.239 |
| neighborhoodWest Colfax | -16.450 | 32.040 | -0.513 | 0.608 |
| neighborhoodWest Highland | 14.683 | 40.075 | 0.366 | 0.714 |
| neighborhoodWestwood | 8.254 | 37.848 | 0.218 | 0.827 |
| neighborhoodWhittier | -44.132 | 41.809 | -1.056 | 0.291 |
| neighborhoodWindsor | -131.054 | 76.982 | -1.702 | 0.089 |
| latitude | 350.855 | 494.402 | 0.710 | 0.478 |
| longitude | 712.799 | 348.336 | 2.046 | 0.041 |
| room_typeHotel room | -97.304 | 50.425 | -1.930 | 0.054 |
| room_typePrivate room | -21.999 | 8.608 | -2.556 | 0.011 |
| room_typeShared room | -94.165 | 32.582 | -2.890 | 0.004 |
| min_nights | -0.196 | 0.102 | -1.924 | 0.054 |
| max_nights | 0.008 | 0.005 | 1.511 | 0.131 |
| number_of_reviews | -0.107 | 0.042 | -2.539 | 0.011 |
| number_of_reviews_ltm | 0.154 | 0.191 | 0.809 | 0.419 |
| review_scores_rating | 53.688 | 18.974 | 2.830 | 0.005 |
| review_scores_accuracy | 14.818 | 15.703 | 0.944 | 0.345 |
| review_scores_cleanliness | 5.910 | 11.543 | 0.512 | 0.609 |
| review_scores_checkin | -12.909 | 15.368 | -0.840 | 0.401 |
| review_scores_communication | -4.704 | 14.054 | -0.335 | 0.738 |
| review_scores_location | 5.775 | 12.125 | 0.476 | 0.634 |
| review_scores_value | -38.161 | 13.996 | -2.727 | 0.006 |
| calculated_host_listings_count | -0.006 | 0.135 | -0.044 | 0.965 |
| reviews_per_month | -0.062 | 2.339 | -0.026 | 0.979 |
| host_tenure | 0.107 | 0.084 | 1.284 | 0.199 |
| since_first_review | 0.378 | 0.132 | 2.870 | 0.004 |
| since_last_review | -0.493 | 0.391 | -1.263 | 0.207 |
| num_bath | 126.016 | 3.093 | 40.739 | 0.000 |
| bath_typeshared | -22.623 | 12.523 | -1.807 | 0.071 |
| model | rmse | mae | rsq |
|---|---|---|---|
| reg_all_pred | 123.485 | 66.392 | 0.448 |
Here is the initial logistic regression model predicting value score category using all predictors.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 1416.850 | 1357.460 | 1.044 | 0.297 |
| host_in_denveryes | 0.510 | 0.224 | 2.281 | 0.023 |
| host_response_timewithin a day | 2.111 | 3.480 | 0.607 | 0.544 |
| host_response_timewithin a few hours | 2.247 | 3.512 | 0.640 | 0.522 |
| host_response_timewithin an hour | 2.274 | 3.515 | 0.647 | 0.518 |
| host_response_rate | -0.019 | 0.019 | -0.981 | 0.327 |
| host_acceptance_rate | -0.005 | 0.006 | -0.920 | 0.357 |
| host_is_superhostyes | 0.456 | 0.164 | 2.786 | 0.005 |
| neighborhoodAuraria | -4.078 | 6.122 | -0.666 | 0.505 |
| neighborhoodBaker | 0.384 | 0.935 | 0.410 | 0.681 |
| neighborhoodBarnum | -1.911 | 1.285 | -1.487 | 0.137 |
| neighborhoodBarnum West | 2.187 | 1.484 | 1.474 | 0.140 |
| neighborhoodBear Valley | 16.619 | 1872.800 | 0.009 | 0.993 |
| neighborhoodBelcaro | 0.048 | 1.637 | 0.029 | 0.977 |
| neighborhoodBerkeley | 0.127 | 1.482 | 0.086 | 0.932 |
| neighborhoodCapitol Hill | -0.803 | 1.120 | -0.717 | 0.473 |
| neighborhoodCBD | 0.085 | 1.109 | 0.077 | 0.939 |
| neighborhoodChaffee Park | -2.151 | 1.746 | -1.232 | 0.218 |
| neighborhoodCheesman Park | 0.601 | 1.297 | 0.463 | 0.643 |
| neighborhoodCherry Creek | -0.155 | 1.314 | -0.118 | 0.906 |
| neighborhoodCity Park | -0.581 | 1.379 | -0.421 | 0.673 |
| neighborhoodCity Park West | -0.876 | 1.220 | -0.718 | 0.473 |
| neighborhoodCivic Center | -0.589 | 1.333 | -0.442 | 0.658 |
| neighborhoodClayton | -1.927 | 1.610 | -1.197 | 0.231 |
| neighborhoodCole | -1.342 | 1.484 | -0.904 | 0.366 |
| neighborhoodCollege View - South Platte | 16.104 | 1295.522 | 0.012 | 0.990 |
| neighborhoodCongress Park | -0.952 | 1.236 | -0.770 | 0.441 |
| neighborhoodCory - Merrill | 0.346 | 1.610 | 0.215 | 0.830 |
| neighborhoodCountry Club | -1.671 | 1.710 | -0.977 | 0.328 |
| neighborhoodDIA | -7.320 | 3.622 | -2.021 | 0.043 |
| neighborhoodEast Colfax | -4.387 | 1.791 | -2.449 | 0.014 |
| neighborhoodElyria Swansea | -1.774 | 1.996 | -0.889 | 0.374 |
| neighborhoodFive Points | -1.334 | 1.249 | -1.068 | 0.286 |
| neighborhoodFort Logan | 18.142 | 1600.953 | 0.011 | 0.991 |
| neighborhoodGateway - Green Valley Ranch | -8.075 | 3.264 | -2.474 | 0.013 |
| neighborhoodGlobeville | -1.721 | 1.833 | -0.939 | 0.348 |
| neighborhoodGoldsmith | -0.905 | 1.673 | -0.541 | 0.588 |
| neighborhoodHale | -1.124 | 1.474 | -0.763 | 0.446 |
| neighborhoodHampden | -0.871 | 2.047 | -0.425 | 0.671 |
| neighborhoodHampden South | 0.437 | 1.938 | 0.225 | 0.822 |
| neighborhoodHarvey Park | 0.971 | 2.218 | 0.438 | 0.662 |
| neighborhoodHarvey Park South | 16.568 | 1366.932 | 0.012 | 0.990 |
| neighborhoodHighland | -0.286 | 1.278 | -0.223 | 0.823 |
| neighborhoodHilltop | -2.359 | 1.818 | -1.298 | 0.194 |
| neighborhoodIndian Creek | -4.907 | 3.018 | -1.626 | 0.104 |
| neighborhoodJefferson Park | -0.270 | 1.200 | -0.225 | 0.822 |
| neighborhoodLincoln Park | 0.349 | 1.193 | 0.293 | 0.770 |
| neighborhoodLowry Field | -1.550 | 1.892 | -0.819 | 0.413 |
| neighborhoodMar Lee | 1.772 | 1.290 | 1.374 | 0.169 |
| neighborhoodMarston | -11.370 | 3956.181 | -0.003 | 0.998 |
| neighborhoodMontbello | -7.997 | 3.335 | -2.398 | 0.016 |
| neighborhoodMontclair | -1.634 | 1.804 | -0.906 | 0.365 |
| neighborhoodNorth Capitol Hill | 0.231 | 1.254 | 0.184 | 0.854 |
| neighborhoodNorth Park Hill | -2.812 | 1.633 | -1.722 | 0.085 |
| neighborhoodNortheast Park Hill | -2.368 | 1.790 | -1.323 | 0.186 |
| neighborhoodOverland | 1.196 | 1.100 | 1.087 | 0.277 |
| neighborhoodPlatt Park | 1.047 | 1.064 | 0.984 | 0.325 |
| neighborhoodRegis | -0.459 | 1.830 | -0.251 | 0.802 |
| neighborhoodRosedale | 1.960 | 1.336 | 1.467 | 0.142 |
| neighborhoodRuby Hill | 2.298 | 1.165 | 1.973 | 0.048 |
| neighborhoodSkyland | -1.831 | 1.478 | -1.238 | 0.216 |
| neighborhoodSloan Lake | 0.908 | 1.210 | 0.750 | 0.453 |
| neighborhoodSouth Park Hill | -3.842 | 1.635 | -2.349 | 0.019 |
| neighborhoodSpeer | 0.397 | 0.925 | 0.429 | 0.668 |
| neighborhoodStapleton | -3.920 | 2.213 | -1.772 | 0.076 |
| neighborhoodSun Valley | 10.727 | 3956.181 | 0.003 | 0.998 |
| neighborhoodSunnyside | -1.453 | 1.457 | -0.997 | 0.319 |
| neighborhoodUnion Station | -0.420 | 1.169 | -0.359 | 0.719 |
| neighborhoodUniversity | -0.688 | 1.240 | -0.555 | 0.579 |
| neighborhoodUniversity Hills | -0.518 | 1.601 | -0.324 | 0.746 |
| neighborhoodUniversity Park | 0.924 | 1.316 | 0.702 | 0.483 |
| neighborhoodValverde | -0.028 | 1.594 | -0.018 | 0.986 |
| neighborhoodVilla Park | 0.524 | 1.170 | 0.448 | 0.654 |
| neighborhoodVirginia Village | -1.397 | 1.414 | -0.988 | 0.323 |
| neighborhoodWashington Park | -0.656 | 1.235 | -0.531 | 0.595 |
| neighborhoodWashington Park West | -0.169 | 0.951 | -0.178 | 0.859 |
| neighborhoodWashington Virginia Vale | -1.448 | 1.488 | -0.973 | 0.330 |
| neighborhoodWellshire | 14.672 | 1409.739 | 0.010 | 0.992 |
| neighborhoodWest Colfax | 0.583 | 1.062 | 0.549 | 0.583 |
| neighborhoodWest Highland | 0.468 | 1.335 | 0.350 | 0.726 |
| neighborhoodWestwood | -0.362 | 1.118 | -0.324 | 0.746 |
| neighborhoodWhittier | -1.141 | 1.408 | -0.810 | 0.418 |
| neighborhoodWindsor | -3.385 | 2.818 | -1.202 | 0.230 |
| latitude | 25.193 | 16.280 | 1.547 | 0.122 |
| longitude | 22.499 | 11.474 | 1.961 | 0.050 |
| room_typeHotel room | 4.478 | 811.739 | 0.006 | 0.996 |
| room_typePrivate room | -0.510 | 0.297 | -1.718 | 0.086 |
| room_typeShared room | -1.304 | 1.403 | -0.930 | 0.352 |
| price | 0.000 | 0.001 | 0.270 | 0.787 |
| min_nights | 0.003 | 0.003 | 0.831 | 0.406 |
| max_nights | 0.000 | 0.000 | 1.673 | 0.094 |
| number_of_reviews | 0.134 | 0.023 | 5.934 | 0.000 |
| number_of_reviews_ltm | 0.018 | 0.028 | 0.643 | 0.520 |
| review_scores_rating | -4.533 | 0.688 | -6.586 | 0.000 |
| review_scores_accuracy | -2.686 | 0.612 | -4.386 | 0.000 |
| review_scores_cleanliness | -0.747 | 0.398 | -1.876 | 0.061 |
| review_scores_checkin | 0.189 | 0.496 | 0.381 | 0.704 |
| review_scores_communication | -0.986 | 0.520 | -1.895 | 0.058 |
| review_scores_location | -2.817 | 0.474 | -5.946 | 0.000 |
| calculated_host_listings_count | 0.001 | 0.004 | 0.185 | 0.853 |
| reviews_per_month | -0.054 | 0.089 | -0.612 | 0.541 |
| host_tenure | 0.000 | 0.003 | -0.144 | 0.886 |
| since_first_review | 0.019 | 0.007 | 2.568 | 0.010 |
| since_last_review | -0.047 | 0.011 | -4.195 | 0.000 |
| num_bath | -0.007 | 0.113 | -0.064 | 0.949 |
| bath_typeshared | 0.267 | 0.414 | 0.644 | 0.519 |
| model | accuracy | sensitivity | specificity | auc |
|---|---|---|---|---|
| log_all_pred | 0.912 | 0.776 | 0.946 | 0.96 |
Here is the lasso regression model which narrowed down the number of predictors used to predict price.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 167.845 | 2.733 | 61.404 | 0.000 |
| host_response_rate | -0.830 | 2.918 | -0.284 | 0.776 |
| longitude | -3.719 | 4.437 | -0.838 | 0.402 |
| min_nights | -4.873 | 3.004 | -1.622 | 0.105 |
| max_nights | 4.943 | 2.735 | 1.807 | 0.071 |
| review_scores_rating | 6.235 | 2.849 | 2.189 | 0.029 |
| since_first_review | 4.348 | 2.768 | 1.571 | 0.116 |
| num_bath | 103.019 | 2.842 | 36.252 | 0.000 |
| host_in_denver_yes | 4.591 | 3.049 | 1.506 | 0.132 |
| neighborhood_Auraria | -1.862 | 2.793 | -0.667 | 0.505 |
| neighborhood_Belcaro | 2.743 | 5.098 | 0.538 | 0.591 |
| neighborhood_Berkeley | -3.407 | 2.965 | -1.149 | 0.251 |
| neighborhood_Capitol.Hill | 3.345 | 2.690 | 1.243 | 0.214 |
| neighborhood_CBD | 10.796 | 2.979 | 3.624 | 0.000 |
| neighborhood_Cheesman.Park | -1.260 | 2.765 | -0.456 | 0.649 |
| neighborhood_City.Park.West | -5.008 | 2.889 | -1.733 | 0.083 |
| neighborhood_Civic.Center | -6.908 | 2.971 | -2.325 | 0.020 |
| neighborhood_Cole | 13.665 | 2.641 | 5.174 | 0.000 |
| neighborhood_Gateway…Green.Valley.Ranch | -7.447 | 3.944 | -1.888 | 0.059 |
| neighborhood_Goldsmith | 2.315 | 2.441 | 0.949 | 0.343 |
| neighborhood_Highland | 5.981 | 2.811 | 2.128 | 0.033 |
| neighborhood_Indian.Creek | NA | NA | NA | NA |
| neighborhood_Lowry.Field | -3.705 | 3.962 | -0.935 | 0.350 |
| neighborhood_Mar.Lee | -3.379 | 2.896 | -1.167 | 0.243 |
| neighborhood_Overland | -2.796 | 3.095 | -0.903 | 0.366 |
| neighborhood_Stapleton | -5.636 | 2.723 | -2.070 | 0.039 |
| neighborhood_Union.Station | 6.707 | 2.745 | 2.444 | 0.015 |
| neighborhood_Washington.Park.West | -2.449 | 2.732 | -0.896 | 0.370 |
| neighborhood_Washington.Virginia.Vale | -2.939 | 2.735 | -1.075 | 0.283 |
| neighborhood_West.Highland | 3.030 | 2.951 | 1.027 | 0.305 |
| neighborhood_Whittier | 0.641 | 2.822 | 0.227 | 0.820 |
| room_type_Hotel.room | -1.168 | 3.512 | -0.333 | 0.740 |
| room_type_Shared.room | -1.315 | 2.949 | -0.446 | 0.656 |
| bath_type_shared | -10.983 | 2.992 | -3.671 | 0.000 |
| model | rmse | mae | rsq |
|---|---|---|---|
| reg_all_pred | 123.485 | 66.392 | 0.448 |
| all_predictor_lasso | 126.494 | 64.499 | 0.406 |
The final equation for the lasso regression is: price ~ min_nights + max_nights + review_scores_rating + num_bath + neighborhood_CBD + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway…Green.Valley.Ranch + neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station + bath_type_shared
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 168.213 | 2.725 | 61.729 | 0.000 |
| min_nights | -6.955 | 2.748 | -2.531 | 0.011 |
| max_nights | 5.629 | 2.687 | 2.095 | 0.036 |
| review_scores_rating | 6.200 | 2.827 | 2.193 | 0.028 |
| num_bath | 102.279 | 2.820 | 36.270 | 0.000 |
| neighborhood_CBD | 11.250 | 2.839 | 3.962 | 0.000 |
| neighborhood_Civic.Center | -6.520 | 2.966 | -2.198 | 0.028 |
| neighborhood_Cole | 13.516 | 2.619 | 5.160 | 0.000 |
| neighborhood_Gateway…Green.Valley.Ranch | -9.793 | 2.734 | -3.582 | 0.000 |
| neighborhood_Highland | 6.839 | 2.696 | 2.536 | 0.011 |
| neighborhood_Stapleton | -6.061 | 2.532 | -2.394 | 0.017 |
| neighborhood_Union.Station | 6.557 | 2.696 | 2.432 | 0.015 |
| bath_type_shared | -10.660 | 2.795 | -3.813 | 0.000 |
| model | rmse | mae | rsq |
|---|---|---|---|
| reg_all_pred | 123.485 | 66.392 | 0.448 |
| all_predictor_lasso | 126.494 | 64.499 | 0.406 |
| reg_lasso | 125.436 | 65.756 | 0.405 |
The lasso model yielded all significant predictors so no backward elimination process was needed.
The final equation for the lasso regression is: value_score_cat ~ number_of_reviews_ltm + review_scores_rating + since_first_review
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 3.908 | 0.212 | 18.424 | 0 |
| number_of_reviews_ltm | 3.614 | 0.305 | 11.839 | 0 |
| review_scores_rating | -2.292 | 0.251 | -9.144 | 0 |
| since_first_review | 1.290 | 0.119 | 10.869 | 0 |
| Measure | Value |
|---|---|
| R-square | 0.464 |
Truth
Prediction excellent other
excellent 128 55
other 53 680
| model | accuracy | sensitivity | specificity | auc |
|---|---|---|---|---|
| log_all_pred | 0.912 | 0.776 | 0.946 | 0.96 |
| log_lasso_full | 0.882 | 0.707 | 0.925 | 0.92 |
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
price ~ .
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)
Main Arguments:
mtry = 5
trees = 500
min_n = 20
Engine-Specific Arguments:
importance = impurity
max.depth = 8
Computational engine: ranger
| model | rmse | mae | rsq |
|---|---|---|---|
| reg_all_pred | 123.485 | 66.392 | 0.448 |
| all_predictor_lasso | 126.494 | 64.499 | 0.406 |
| reg_lasso | 125.436 | 65.756 | 0.405 |
| reg_full_rf | 151.885 | 75.146 | 0.409 |
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
price ~ .
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)
Main Arguments:
mtry = 5
trees = 1000
min_n = 10
Engine-Specific Arguments:
importance = impurity
max.depth = 8
Computational engine: ranger
| model | rmse | mae | rsq |
|---|---|---|---|
| reg_all_pred | 123.485 | 66.392 | 0.448 |
| all_predictor_lasso | 126.494 | 64.499 | 0.406 |
| reg_lasso | 125.436 | 65.756 | 0.405 |
| reg_full_rf | 151.885 | 75.146 | 0.409 |
| reg_lasso_rf | 135.934 | 67.762 | 0.484 |
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
value_score_cat ~ .
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
mtry = 4
trees = 500
min_n = 10
Engine-Specific Arguments:
importance = impurity
max.depth = 8
Computational engine: ranger
Truth
Prediction excellent other
excellent 105 22
other 76 713
| model | accuracy | sensitivity | specificity | auc |
|---|---|---|---|---|
| log_all_pred | 0.912 | 0.776 | 0.946 | 0.96 |
| log_lasso_full | 0.882 | 0.707 | 0.925 | 0.92 |
| cat_full_rf | 0.893 | 0.580 | 0.970 | 0.95 |
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
value_score_cat ~ .
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Main Arguments:
mtry = 2
trees = 1000
min_n = 20
Engine-Specific Arguments:
importance = impurity
max.depth = 5
Computational engine: ranger
Truth
Prediction excellent other
excellent 135 41
other 46 694
| model | accuracy | sensitivity | specificity | auc |
|---|---|---|---|---|
| log_all_pred | 0.912 | 0.776 | 0.946 | 0.96 |
| log_lasso_full | 0.882 | 0.707 | 0.925 | 0.92 |
| cat_full_rf | 0.893 | 0.580 | 0.970 | 0.95 |
| cat_lasso_rf | 0.905 | 0.746 | 0.944 | 0.96 |
| model | rmse | mae | rsq |
|---|---|---|---|
| reg_all_pred | 123.485 | 66.392 | 0.448 |
| all_predictor_lasso | 126.494 | 64.499 | 0.406 |
| reg_lasso | 125.436 | 65.756 | 0.405 |
| reg_full_rf | 151.885 | 75.146 | 0.409 |
| reg_lasso_rf | 135.934 | 67.762 | 0.484 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 168.213 | 2.725 | 61.729 | 0.000 |
| min_nights | -6.955 | 2.748 | -2.531 | 0.011 |
| max_nights | 5.629 | 2.687 | 2.095 | 0.036 |
| review_scores_rating | 6.200 | 2.827 | 2.193 | 0.028 |
| num_bath | 102.279 | 2.820 | 36.270 | 0.000 |
| neighborhood_CBD | 11.250 | 2.839 | 3.962 | 0.000 |
| neighborhood_Civic.Center | -6.520 | 2.966 | -2.198 | 0.028 |
| neighborhood_Cole | 13.516 | 2.619 | 5.160 | 0.000 |
| neighborhood_Gateway…Green.Valley.Ranch | -9.793 | 2.734 | -3.582 | 0.000 |
| neighborhood_Highland | 6.839 | 2.696 | 2.536 | 0.011 |
| neighborhood_Stapleton | -6.061 | 2.532 | -2.394 | 0.017 |
| neighborhood_Union.Station | 6.557 | 2.696 | 2.432 | 0.015 |
| bath_type_shared | -10.660 | 2.795 | -3.813 | 0.000 |
| model | accuracy | sensitivity | specificity | auc |
|---|---|---|---|---|
| log_all_pred | 0.912 | 0.776 | 0.946 | 0.96 |
| log_lasso_full | 0.882 | 0.707 | 0.925 | 0.92 |
| cat_full_rf | 0.893 | 0.580 | 0.970 | 0.95 |
| cat_lasso_rf | 0.905 | 0.746 | 0.944 | 0.96 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 3.908 | 0.212 | 18.424 | 0 |
| number_of_reviews_ltm | 3.614 | 0.305 | 11.839 | 0 |
| review_scores_rating | -2.292 | 0.251 | -9.144 | 0 |
| since_first_review | 1.290 | 0.119 | 10.869 | 0 |
Most Proud
I am most proud of my data cleaning and variable transformations. I worked hard to make sure that all of the necessary transformations were done in R so that I could import the data straight in from the InsideAirbnb.com website at any time. Some of the code is clunky (I used a massive case statement to recode the bathroom variable), but it works!
If I Had More Time
If I had another week I would try a log transformation on the price variable. It’s just so skewed. I don’t think it would change the fact that number of bathrooms (property size, really) is the dominant factor for price, but it might allow for more insight about what else matters.
For the categorical variable, I would want to try some regular trees using all the predictors. It’s interesting that the overall customer rating is negatively correlated with the value score rating. I’m curious if other score ratings would be positively or negatively correlated to it.
---
title: "Kabira Project"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: scroll
source_code: embed
theme: yeti
---
```{r setup, include=FALSE,warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(GGally) #v2.1.2
library(ggcorrplot) #v0.1.3
library(MASS) #v7.3-54 for Boston data
library(flexdashboard) #v0.5.2
library(plotly) #v4.10.1
library(crosstalk) #v1.2.0
library(tidymodels)
#library(dplyr) #v1.0.7 %>%, select(), select_if(), filter(), mutate(), group_by(),
#summarize(), tibble()
#library(ggplot2) #v3.3.5 ggplot()
library(ISLR) #v1.4 Default, Auto dataset
library(themis) #v1.0.0 step_smote
library(tidymodels)
library(parsnip) #v1.0.3 linear_reg(), discrim_regularized(), set_engine(), set_mode(), fit(), predict()
library(yardstick) #v1.1.0 metrics(), roc_auc(), roc_curve(), metric_set(), conf_matrix()
library(dplyr) #v1.0.10 %>%, select(), select_if(), filter(), mutate(), group_by(),
#summarize(), tibble()
#library(ggplot2) #v3.4.0 ggplot()
#library(broom) #v1.0.2 for tidy(), augment(), glance()
#library(rsample) #v1.1.1 initial_split(), training(), testing()
library(readr) #v2.1.3 read_csv()
library(knitr) #v1.41 kable()
library(stringr)
theme_set(theme_bw()) #sets default ggplot output style
library(boot) #1.3-28.1 boot()
library(discrim) #v1.0.0 discriminant analysis wrapper
library(janitor) #v2.1.0 clean_names()
library(vip) #0.3.2 vip() (variable importance)
library(glmnet) #v4.1-6 for ridge/lasso regression
library(skimr) #v2.1.5
```
```{r load_data}
# read in compressed file
abb1 <- read_csv("listings.csv.gz")
# for this analysis, I did a lot of recoding of the raw file downloaded from the website
# 1. I eliminated the variables that described the property, as well as the property's id and the host's id
# 2. I recoded TRUE/FALSE (lgl type) variables into factor variables (fctr type)
# 3. I recoded variables that had % or $ signs into numeric values (dbl type)
# 4. I recoded variables that had a limited number of choices (for instance the neighborhood's name) into factor variables
# 5. I eliminated date variables (date type) and made new numeric variables that list the number of elapsed months from the date until 2022.12.31 (the date that the data was scraped from the website)
# 6. I eliminated a variable that included both the number and type of baths and put the number of baths into one new numeric variable and the type of baths into a factor variable
# 7. I created a new (factor type) categorical variable for the purposes of doing a model to predict a categorical variable: if the review_scores_value was 5.0 (the highest possible rating), I coded it as 'excellent'; all other values were coded as 'other'.
# 8. I deleted all records that had null values. This reduced the number of records from 5250 to 3050.
abb1 <- abb1 %>%
mutate(host_is_superhost = ifelse(host_is_superhost == TRUE, "yes", "no"),
host_in_denver = ifelse(host_location == "Denver, CO", "yes", "no"),
neighborhood = neighbourhood_cleansed, max_guests = accommodates,
min_nights = minimum_nights_avg_ntm, max_nights = maximum_nights_avg_ntm)
abb1 <- abb1 %>%
select(id, name, host_id, host_name, host_since, host_in_denver, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, neighborhood, latitude, longitude, room_type, max_guests, bathrooms_text, bedrooms, beds, price, min_nights, max_nights, number_of_reviews, number_of_reviews_ltm, first_review, last_review, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, calculated_host_listings_count, reviews_per_month)
abb1 <- abb1 %>%
mutate(price = ifelse(price == 'N/A' | price== '$0.00', 0, price)) %>%
mutate(host_response_rate = ifelse(host_response_rate== 'N/A','0%',host_response_rate)) %>%
mutate(host_acceptance_rate = ifelse(host_acceptance_rate== 'N/A', '0%', host_acceptance_rate)) %>%
mutate(id = as.character(id), host_id = as.character(host_id)) %>%
mutate(host_response_time = ifelse(host_response_time== 'N/A', NA, host_response_time)) %>%
mutate(neighborhood = ifelse(neighborhood == 'Kennedy' | neighborhood == 'Southmoor Park', NA, neighborhood))
abb1 <- abb1 %>%
mutate(price = as.numeric(gsub("[\\$,]", "", price)))%>%
mutate(host_response_rate = as.numeric(gsub("%$", "", host_response_rate)))%>%
mutate(host_acceptance_rate = as.numeric(gsub("%$", "", host_acceptance_rate))) %>%
mutate(host_response_time = as.factor(host_response_time)) %>%
mutate(room_type = as.factor(room_type)) %>%
mutate(neighborhood = as.factor(neighborhood)) %>%
mutate(host_in_denver = as.factor(host_in_denver)) %>%
mutate(host_is_superhost = as.factor(host_is_superhost)) %>%
mutate(price = ifelse(price > 0 & price < 3000, price, NA)) %>%
mutate(host_response_rate = ifelse(host_response_rate > 0, host_response_rate, NA)) %>%
mutate(host_acceptance_rate = ifelse(host_acceptance_rate >0, host_acceptance_rate, NA)) %>%
mutate(host_tenure = as.numeric(difftime("2022-12-31", abb1$host_since, units = "days"))/(365.25/12),
since_first_review = as.numeric(difftime("2022-12-31", abb1$first_review, units = "days"))/(365.25/12),
since_last_review = as.numeric(difftime("2022-12-31", abb1$last_review, units = "days"))/(365.25/12)) %>%
na.omit()
abb1 <- abb1 %>%
mutate(num_bath = case_when(bathrooms_text == 'Shared half-bath' ~ .5,
bathrooms_text == '1 bath' ~ 1,
bathrooms_text == '1 shared bath' ~ 1,
bathrooms_text == '1 private bath' ~ 1,
bathrooms_text == '1.5 baths' ~ 1.5,
bathrooms_text == '1.5 shared baths' ~ 1.5,
bathrooms_text == '2 baths' ~ 2,
bathrooms_text == '2 shared baths' ~ 2,
bathrooms_text == '2.5 baths' ~ 2.5,
bathrooms_text == '2.5 shared baths' ~ 2.5,
bathrooms_text == '3 baths' ~ 3,
bathrooms_text == '3 shared baths' ~ 3,
bathrooms_text == '3.5 baths' ~ 3.5,
bathrooms_text == '3.5 shared baths' ~ 3.5,
bathrooms_text == '4 baths' ~ 4,
bathrooms_text == '4 shared baths' ~ 4,
bathrooms_text == '4.5 baths' ~ 4.5,
bathrooms_text == '4.5 shared baths' ~ 4.5,
bathrooms_text == '5 baths' ~ 5,
bathrooms_text == '5 shared baths' ~ 5,
bathrooms_text == '5.5 baths' ~ 5.5,
bathrooms_text == '5.5 shared baths' ~ 5.5,
bathrooms_text == '6 baths' ~ 6,
bathrooms_text == '6 shared baths' ~ 6,
bathrooms_text == '6.5 baths' ~ 6.5,
bathrooms_text == '6.5 shared baths' ~ 6.5,
bathrooms_text == '7 baths' ~ 7,
bathrooms_text == '7 shared baths' ~ 7,
bathrooms_text == '7.5 baths' ~ 7.5,
bathrooms_text == '7.5 shared baths' ~ 7.5,
bathrooms_text == '8 baths' ~ 8,
bathrooms_text == '8 shared baths' ~ 8,
bathrooms_text == '8.5 baths' ~ 8.5,
bathrooms_text == '8.5 shared baths' ~ 8.5,
bathrooms_text == '9 baths' ~ 9,
bathrooms_text == '9 shared baths' ~ 9,
bathrooms_text == '9.5 baths' ~ 9.5,
bathrooms_text == '9.5 shared baths' ~ 9.5)) %>%
mutate(bath_type =
as.factor(case_when(bathrooms_text == 'Shared half-bath' ~ 'shared',
bathrooms_text == '1 bath' ~ 'private',
bathrooms_text == '1 shared bath' ~ 'shared',
bathrooms_text == '1 private bath' ~ 'private',
bathrooms_text == '1.5 baths' ~ 'private',
bathrooms_text == '1.5 shared baths' ~ 'shared',
bathrooms_text == '2 baths' ~ 'private',
bathrooms_text == '2 shared baths' ~ 'shared',
bathrooms_text == '2.5 baths' ~ 'private',
bathrooms_text == '2.5 shared baths' ~ 'shared',
bathrooms_text == '3 baths' ~ 'private',
bathrooms_text == '3 shared baths' ~ 'shared',
bathrooms_text == '3.5 baths' ~ 'private',
bathrooms_text == '3.5 shared baths' ~ 'shared',
bathrooms_text == '4 baths' ~ 'private',
bathrooms_text == '4 shared baths' ~ 'shared',
bathrooms_text == '4.5 baths' ~ 'private',
bathrooms_text == '4.5 shared baths' ~ 'shared',
bathrooms_text == '5 baths' ~ 'private',
bathrooms_text == '5 shared baths' ~ 'shared',
bathrooms_text == '5.5 baths' ~ 'private',
bathrooms_text == '5.5 shared baths' ~ 'shared',
bathrooms_text == '6 baths' ~ 'private',
bathrooms_text == '6 shared baths' ~ 'shared',
bathrooms_text == '6.5 baths' ~ 'private',
bathrooms_text == '6.5 shared baths' ~ 'shared',
bathrooms_text == '7 baths' ~ 'private',
bathrooms_text == '7 shared baths' ~ 'shared',
bathrooms_text == '7.5 baths' ~ 'private',
bathrooms_text == '7.5 shared baths' ~ 'shared',
bathrooms_text == '8 baths' ~ 'private',
bathrooms_text == '8 shared baths' ~ 'shared',
bathrooms_text == '8.5 baths' ~ 'private',
bathrooms_text == '8.5 shared baths' ~ 'shared',
bathrooms_text == '9 baths' ~ 'private',
bathrooms_text == '9 shared baths' ~ 'shared',
bathrooms_text == '9.5 baths' ~ 'private',
bathrooms_text == '9.5 shared baths' ~ 'shared')))
abb1 <- abb1 %>%
mutate(value_score_cat = as.factor(if_else(abb1$review_scores_value == 5.00, "excellent", "other")))
abb1 <- abb1 %>%
select(., - bathrooms_text, -id, -name, -host_id, -host_name, -host_since, -first_review, -last_review)
# creating two new datasets, one for continuous predictor and one for categorical predictor
abb_reg <- abb1 %>%
select(., -value_score_cat)
abb_cat <- abb1 %>%
select(., -review_scores_value)
```
Exec. Summary {data-orientation=rows}
=======================================================================
Row {data-height=400}
-----------------------------------------------------------------------
### Executive Summary
#### **This project examined two questions:**
1. What factors influence the price per night of an airbnb property in Denver, Colorado?
2. What factors influence a property receiving a perfect score for value?
#### **Data**
This is a dataset from a website called InsideAirbnb.com. It is a dataset that was scraped on December 30 and 31, 2022 from Airbnb for the previous 12 months’ listings for Denver, CO.
The original dataset has 3050 rows and 32 variables. The variables were all normalized because of their vast differences in scale (some variables had values in the thousands; others had a maximum value of 5). The number of normalized variables was 105.
#### **Methodology**
Analyses were done to predict two variables. Price was the continuous variable, and value_score_cat was the categorical variable. There were two values for value_score_cat. A perfect score of 5.0 was coded as "excellent”, and anything else was coded as “other”.
The following methodology was undertaken for each of the variables:
• Examine the distribution of the variables to look for relationships
• Conduct simple regression analyses for baselines
• Normalize all variables
• Employ lasso regression to narrow down the number of variables
• Build one random forest model with all predictors and a second one with only the lasso predictors
Column {data-width=400, data-height=400}
-----------------------------------------------------------------------
#### **Conclusions**
### *Price*
Number of baths ended up being the most important predictor of price, by a huge margin. Number of baths is a good proxy for property size: more baths suggests larger property. Price being higher for a larger property makes sense.
Looking at the analyses as a whole, the three most important factors that a host could change are minimum nights, review scores rating, and maximum nights. We see from the regression coefficients that having a lower number of minimum nights, a higher number of maximum nights, and a higher review scores rating all increase price. The recommendations for a host would be to be flexible (allow for both shorter and longer stays) and to work hard to please the customers.
### *Value Score*
Review scores rating is the most important predictor of value score.
Interestingly, review scores rating is **negatively** correlated with value score. This would mean that higher-rated properties overall are rated as having lower value for the price. The other two significant predictors from the lasso regression are the number of reviews in the last 12 months and the number of months since the first review of the property. These are both positively correlated to value score, which means that properties that have been airbnbs longer and which have had more customers recently receive better ratings for value.
The takeaway from the negative correlation between review scores rating and value score may be that, even though customers have a great overall experience at an airbnb, they may believe that it was overpriced.
More research should be done before offering airbnb hosts advice on this matter, however.
Introduction {data-orientation=rows}
=======================================================================
Row {data-height=1250}
-----------------------------------------------------------------------
### The Project
#### The Problem Description
This is a dataset from a website called InsideAirbnb.com. It is a dataset that was scraped on December 30 and 31, 2022 from Airbnb for the previous 12 months’ listings for Denver, CO. The website is http://insideairbnb.com/get-the-data. Variables from the downloaded detailed dataset have been transformed in R in order to conduct the analyses. The distribution of variables will be examined first in order to look for relationships. Regression analysis will be performed in order to predict a property's per-night price. Following normalization of the variables, lasso regression will be employed to narrow down the number of variables, as well. One random forest model will be run with all predictors and a second one with the lasso predictors. A classification analysis will be used to predict a customer's rating of the value of the property (whether the property is a good value for the price paid). Because the ratings skew high, there were two ratings assigned: 'excellent' for a perfect rating and 'other' for anything less. Lasso regression will again be used in order to narrow down the number of predictors, followed by logistic regression. One random forest model will be run with all predictors and a second one with the lasso predictors. Finally, a summary of conclusions will be presented.
#### The Data
This dataset has 3050 rows and 32 variables.
#### Data Sources
This is a dataset from a website called InsideAirbnb.com. It is a dataset that was scraped on December 30 and 31, 2022 from Airbnb for the previous 12 months’ listings for Denver, CO. The website is http://insideairbnb.com/get-the-data.
### The Data
VARIABLES TO PREDICT WITH
* **host_in_denver**: whether or not the host lives in Denver
* **host_response_time**: how quickly a host responds to messages
* **host_response_rate**: the percentage of requests to which a host has responded
* **host_acceptance_rate**: the percentage of booking requests that a host accepts
* **host_is_superhost**: whether a host has earned the Airbnb distinction “Superhost”
* **neighborhood**: neighborhood name of the property’s location
* **latitude**: latitude of property listing
* **longitude**: longitude of property listing
* **room_type**: whether the entire home/apartment is being rented or if it is a private room
* **max_guests**: maximum number of guests
* **bedrooms**: number of bedrooms
* **beds**: total number of beds
* **price**: per/night price (in US$) - this is a predictor for the classification analysis
* **min_nights**: minimum number of consecutive nights that the property may be rented
* **max_nights**: maximum number of consecutive nights that the property may be rented
* **number_of_reviews**: total number of reviews that a listing has
* **number_of_reviews_ltm**: number of reviews that a listing has had in the last twelve months
* **review_scores_rating**: overall average rating for the property
* **review_scores_accuracy**: average rating for the accuracy of the property’s description
* **review_scores_cleanliness**: average rating for the cleanliness of the property
* **review_scores_checkin**: average rating for the ease of check-in
* **review_scores_communication**: average rating for the host’s communication
* **review_scores_location**: average rating for the property’s location
* **review_scores_value**: average rating for the renters’ assessments of value of the rental experience for the price
* **calculated_host_listings_count**: number of listings that a host has
* **reviews_per_month**: number of total reviews / the number of months that the property has been listed
* **host_tenure**: number of months that the host has been an Airbnb member
* **since_first_review**: number of months since the first review
* **since_last_review**: number of months since the most recent review
* **num_bath**: number of bathrooms available to guests
* **bath_type**: whether the bathroom(s) are shared or private
VARIABLES WE WANT TO PREDICT
* **price**: per/night price (in US$)
* **value_score_cat**: value calculated from the review_scores_value – if the review_scores_value is 5.0, the property is rated as ‘excellent’, if not, it is rated ‘other’
Explorations {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=350}
-------------------------------------
### Data Overview
From this data we can see that some of the variables have very wide ranges (since_first_review, for instance) and others very narrow ones with some extreme values (min_nights and beds, for instance). Review scores for each category (except for review_scores_value) all have median values of at least 4.9, but the mean value for every one of the review scores is lower than the median value.
Looking at the average price by max_guests table, we see that - as expected - the mean price tends to increase with the maximum number of guests. There is one surprise: the max_guests of 15 has a mean price far below the values for 14 and 16 guests. However, the n for 15 max_guests is low so a single low value could skew the mean dramatically.
Column {data-width=450, data-height=1700}
-----------------------------------------------------------------------
### View the Data Summaries
Here are the ranges of values for each variable.
```{r, cache=TRUE}
#View data
summary(abb1)
```
Column {data-width=150, data-height=600}
-----------------------------------------------------------------------
### Average Price by `max_guests` (Maximum number of guests at one time)
```{r, cache=TRUE}
knitr::kable(abb1 %>%
group_by(max_guests) %>%
summarize(n=n(), mean(price)),digits=2)
```
Visualizations {data-orientation=rows}
=======================================================================
### Response Variables relationships with predictors
* Unsurprisingly, we see that the price values are very right-skewed. Although the median price is $125, the maximum price is $2614. Of the continuous predictors, max_guests, beds, bedrooms, and num_bath have the highest correlations with price. That would be logical, since a property that can accommodate more guests would likely command a higher price.
* We see that the 'excellent' value score category makes up approximately 20% of the total. One interesting finding is that non-superhosts outnumber superhosts in the 'excellent' value score category, even though the percentage of superhosts is much higher than the percentage of hosts who are not superhosts.
* After finding collinearity between beds, bedrooms, num_baths, and max_guests, individual regressions were run between price and the four collinear variables in order to select the one with the largest coefficient. Num_bath, having the highest coefficient of 127.8, was retained in the dataset. The other three variables were eliminated.
row {data-height=550}
-----------------------------------------------------------------------
#### Value Score
```{r, cache=TRUE}
ggplot(abb1,aes(x=value_score_cat)) + geom_bar()
```
#### Price
```{r, cache=TRUE}
ggplot(abb1, aes(price)) + geom_histogram(bins=15)
```
Row {.tabset data-height=480}
-----------------------------------------------------------------------
### Price vs Continuous Variables #1
```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(abb1,price,host_response_rate, host_acceptance_rate, latitude, longitude, max_guests)))
```
### Price vs Continuous Variables #2
```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(abb1,price,bedrooms, beds, price, min_nights, max_nights)))
```
### Price vs Continuous Variables #3
```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(abb1,price,number_of_reviews, number_of_reviews_ltm, review_scores_rating, review_scores_accuracy, review_scores_cleanliness)))
```
### Price vs Continuous Variables #4
```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(abb1,price,review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, calculated_host_listings_count)))
```
### Price vs Continuous Variables #5
```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(abb1,price,reviews_per_month, host_tenure, since_first_review, since_last_review, num_bath)))
```
### Checking Correlations Between Property Size Proxy Variables
```{r, cache=TRUE}
ggpairs(dplyr::select(abb1, max_guests, num_bath, bedrooms, beds), progress=FALSE)
```
### Checking Regressions Individually Between Price and Property Size Proxy Variables
```{r}
reg_spec <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
reg_max_guests_fit <- reg_spec %>%
fit(price ~ max_guests, data = abb_reg)
tidy(reg_max_guests_fit$fit) %>%
kable(digits=3)
```
```{r}
reg_num_bath_fit <- reg_spec %>%
fit(price ~ num_bath, data = abb_reg)
tidy(reg_num_bath_fit$fit) %>%
kable(digits=3)
```
```{r}
reg_bedrooms_fit <- reg_spec %>%
fit(price ~ bedrooms, data = abb_reg)
tidy(reg_bedrooms_fit$fit) %>%
kable(digits=3)
```
```{r}
reg_beds_fit <- reg_spec %>%
fit(price ~ beds, data = abb_reg)
tidy(reg_beds_fit$fit) %>%
kable(digits=3)
```
### Value Score vs Host Is Superhost
```{r, cache=TRUE}
abb1 %>% group_by(host_is_superhost, value_score_cat) %>%
summarize(n=n()) %>%
ggplot(aes(y=n, x=value_score_cat,fill=host_is_superhost)) +
geom_bar(position="dodge", stat="identity") +
geom_text(aes(label=n), position=position_dodge(width=0.9), vjust=-0.25) +
ggtitle("Value Score vs Host Is Superhost") +
coord_flip() #makes horizontal
```
```{r}
# eliminate collinear variables (beds, bedrooms, max_guests)
abb_reg <- abb_reg %>%
dplyr::select(., -beds, -bedrooms, -max_guests)
abb_cat <- abb_cat %>%
dplyr::select(., -beds, -bedrooms, -max_guests)
```
Initial Models {data-orientation=rows}
=======================================================================
Row
-----------------------------------------------------------------------
### Baseline Models
Linear and logistic regression models were run as baseline models. All predictors were used with these initial models.
We can see from the actual-predicted plot for the linear regression that there are a number of price outliers that are likely causing problems with predictions. Given the large number of outliers, the R-square value of .448 is not surprising.
In contrast, the logistic regression seems to be fairly accurate overall, with accuracy of .912 and an AUC of .96. It is achieving this accuracy with a high specificity of .946 and a rather lower sensitivity of .776.
-----------------------------------------------------------------------
Row{data-height=2000, column-width = 700, .tabset .tabset-fade}
-----------------------------------------------------------------------
### Predicting Price
Here is the initial regression model predicting price using all predictors.
```{r}
reg_all_pred_fit <- reg_spec %>%
fit(price ~ ., data = abb_reg)
tidy(reg_all_pred_fit$fit) %>%
kable(digits=3)
my_reg_metrics = metric_set(yardstick::rmse,
yardstick::mae,
yardstick::rsq)
pred_reg_all_pred_fit <-reg_all_pred_fit %>%
augment(abb_reg)
curr_reg_metrics <- pred_reg_all_pred_fit %>%
my_reg_metrics(truth=price, estimate = .pred)
results_abb_reg <-tibble(model = 'reg_all_pred',
rmse = curr_reg_metrics[[1,3]],
mae = curr_reg_metrics[[2,3]],
rsq = curr_reg_metrics[[3,3]])
```
#### The Full Regression Model Metrics
```{r}
results_abb_reg %>%
kable(digits=3)
```
#### Actual vs Predicted Graph
```{r, cache=TRUE}
pred_reg_all_pred_fit %>%
ggplot(aes(x=price, y=.pred)) +
geom_point(col = "#6e0000") +
geom_abline(col="gold") +
ggtitle("Predicted Price vs Actual Price")
```
### Predicting Value Score Category
Here is the initial logistic regression model predicting value score category using all predictors.
```{r}
log_spec <- logistic_reg() %>%
set_engine('glm') %>%
set_mode('classification')
log_all_pred_fit <- log_spec %>%
fit(value_score_cat ~ ., data = abb_cat)
tidy(log_all_pred_fit$fit) %>%
kable(digits=3)
pred_log_all_pred <- log_all_pred_fit %>%
augment(abb_cat)
my_abb_cat_metrics <- metric_set(yardstick::accuracy,yardstick::sensitivity,
yardstick::specificity)
abb_cat_curr_metrics <- pred_log_all_pred %>%
my_abb_cat_metrics(truth = value_score_cat, estimate = .pred_class)
curr_auc <- pred_log_all_pred %>%
roc_auc(truth = value_score_cat, estimate = .pred_excellent) %>%
pull(.estimate)
results_abb_cat <- tibble(model = 'log_all_pred',
accuracy = abb_cat_curr_metrics[[1,3]],
sensitivity = abb_cat_curr_metrics[[2,3]],
specificity = abb_cat_curr_metrics[[3,3]],
auc = round(curr_auc,2))
```
#### The Full Classification Model Metrics
```{r}
results_abb_cat %>%
kable(digits = 3)
```
```{r}
# ROC
#Capture the thresholds and sens/spec
abb_cat_roc <- pred_log_all_pred %>%
roc_curve(truth = value_score_cat, estimate=.pred_excellent) %>%
mutate(model = paste('log_all_pred',round(curr_auc,2)))
```
#### ROC Curve
```{r, cache=TRUE}
#Plot the ROC Curve(s)
ggplot(abb_cat_roc,
aes(x = 1 - specificity, y = sensitivity,
group = model, col = model)) +
geom_path() +
geom_abline(lty = 3) +
scale_color_brewer(palette = "Dark2") +
theme(legend.position = "top")
```
Lasso-P {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=520}
----------------------------------------------------------------------
### Lasso Model for Continuous Price Variable
Because of the large number of predictors in the dataset, lasso regression was used for narrowing down the number of predictors for the continuous variable (price). and the categorical variable (value_score_cat).
-----------------------------------------------------------------------
#### Price Predictors
After using the lasso regression, the number of predictors was reduced to 33. When all of these predictors were used in a linear regression model, many of them ended up being statistically insignificant so another 21 variables were eliminated through a backward elimination process. We can see the final regression coefficients below.
We can see from the VIP plot that num_bath is more than 5 times more important than the next largest factor (neighborhood_Cole). Number of bathrooms is a good proxy for the size of the property, and it makes sense that larger properties could command higher prices.
The metric comparison table shows that, both before and after performing backward elimination with the insignificant predictors, the lasso regression has a lower r-square (.406 then .405) than the original linear regression model (.448). The lasso regressions do have slightly lower mean absolute errors than does the base model.
We can see from the graph comparing the predictions of the original regression model and the final lasso model that both models tend to overpredict at lower prices, with the lasso regression exhibiting that trend more strongly.
Column {data-width=400, data-height=1170}
-----------------------------------------------------------------------
### Predicting Price
Here is the lasso regression model which narrowed down the number of predictors used to predict price.
```{r}
#continuous variable splitting into training and testing datasets
abb_reg_recipe <- recipe(price ~ ., data = abb_reg) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors()) %>%
prep()
abb_reg_norm <- bake(abb_reg_recipe, abb_reg)
```
```{r}
set.seed(12934)
abb_reg_split_norm <- initial_split(abb_reg_norm, prop = .7)
abb_reg_train_norm <- rsample::training(abb_reg_split_norm)
abb_reg_test_norm <- rsample::testing(abb_reg_split_norm)
```
```{r}
my_reg_metrics = metric_set(yardstick::rmse,
yardstick::mae,
yardstick::rsq)
#folds
abb_reg_grid <- tibble(penalty=seq(.1, 50, len = 500))
abb_reg_fold <- vfold_cv(abb_reg_train_norm, v=5)
```
```{r}
#Define Lasso Model Specifications
abb_reg_lassotune_spec <- linear_reg(penalty = tune(),
mixture = 1) %>%
set_engine("glmnet") %>%
set_mode("regression")
#Create the workflow, add the recipe and tune on penalty
abb_reg_lassotune_wf <- workflow() %>%
add_model(abb_reg_lassotune_spec) %>%
add_formula(price ~ .)
abb_reg_lassotune_rs <- abb_reg_lassotune_wf %>%
tune_grid(resamples = abb_reg_fold,
grid =abb_reg_grid,
metrics = my_reg_metrics)
lowest_rmse_lasso <- abb_reg_lassotune_rs %>%
select_best("rmse", penalty)
abb_reg_final_lasso <- abb_reg_lassotune_wf %>%
finalize_workflow(lowest_rmse_lasso)
abb_reg_final_lasso_fit <- abb_reg_final_lasso %>%
fit(abb_reg_train_norm)
pred_abb_reg_final_lasso_fit <- abb_reg_final_lasso_fit %>%
augment(abb_reg_train_norm)
```
```{r}
#abb_reg_final_lasso_fit %>%
# extract_fit_parsnip() %>%
# tidy() %>%
# filter(estimate != 0) %>%
#kable()
```
```{r}
# updating normalized datasets with pared-down variables
abb_reg_norm2 <- abb_reg_norm %>%
dplyr::select(price, host_response_rate, longitude, min_nights, max_nights, review_scores_rating, since_first_review, num_bath, host_in_denver_yes, neighborhood_Auraria, neighborhood_Belcaro, neighborhood_Berkeley, neighborhood_Capitol.Hill, neighborhood_CBD, neighborhood_Cheesman.Park, neighborhood_City.Park.West, neighborhood_Civic.Center, neighborhood_Cole, neighborhood_Gateway...Green.Valley.Ranch, neighborhood_Goldsmith, neighborhood_Highland, neighborhood_Indian.Creek, neighborhood_Lowry.Field, neighborhood_Mar.Lee, neighborhood_Overland, neighborhood_Stapleton, neighborhood_Union.Station, neighborhood_Washington.Park.West, neighborhood_Washington.Virginia.Vale, neighborhood_West.Highland, neighborhood_Whittier, room_type_Hotel.room, room_type_Shared.room, bath_type_shared)
```
```{r}
set.seed(12938)
abb_reg_split_norm2 <- initial_split(abb_reg_norm2, prop = .7)
abb_reg_train_norm2 <- rsample::training(abb_reg_split_norm2)
abb_reg_test_norm2 <- rsample::testing(abb_reg_split_norm2)
```
```{r}
# new folds for lasso dataset
abb_reg_fold2 <- vfold_cv(abb_reg_train_norm2, v=5)
```
```{r}
reg_spec <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
abb_reg_lasso_norm_fit <- reg_spec %>%
fit(price ~ ., data = abb_reg_train_norm2)
```
-----------------------------------------------------------------------
#### Results of linear regression with lasso-reduced predictors
```{r, cache=TRUE}
tidy(abb_reg_lasso_norm_fit$fit) %>%
kable(digits=3)
```
```{r}
curr_reg_metrics <- pred_abb_reg_final_lasso_fit %>%
my_reg_metrics(truth=price, estimate = .pred)
# adding results to comparison table
results_abb_reg_new <-tibble(model = 'all_predictor_lasso',
rmse = curr_reg_metrics[[1,3]],
mae = curr_reg_metrics[[2,3]],
rsq = curr_reg_metrics[[3,3]])
results_abb_reg <- bind_rows(results_abb_reg, results_abb_reg_new)
```
Column {data-width=400, data-height=170}
-----------------------------------------------------------------------
#### Metrics for Model Including All Lasso Predictors
```{r, cache=TRUE}
results_abb_reg %>%
kable(digits=3)
```
```{r}
#backwards elimination of insignificant predictors
#eliminate neighborhood_Indian.Creek and neighborhood_Whittier
#abb_reg_lasso_noicwh_fit <- reg_spec %>%
# fit(price ~ host_response_rate + longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Auraria + neighborhood_Belcaro + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_Cheesman.Park + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + room_type_Hotel.room + room_type_Shared.room + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_noicwh_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate host_response_rate
#abb_reg_lasso_host_fit <- reg_spec %>%
# fit(price ~ longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Auraria + neighborhood_Belcaro + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_Cheesman.Park + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + room_type_Hotel.room + room_type_Shared.room + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_host_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate room_type_Hotel.room
#abb_reg_lasso_nohotel_fit <- reg_spec %>%
# fit(price ~ longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Auraria + neighborhood_Belcaro + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_Cheesman.Park + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + room_type_Shared.room + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_nohotel_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate room_type_Shared.room
#abb_reg_lasso_noshared_fit <- reg_spec %>%
# fit(price ~ longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Auraria + neighborhood_Belcaro + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_Cheesman.Park + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_noshared_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate neighborhood_Cheesman.Park
#abb_reg_lasso_nochees_fit <- reg_spec %>%
# fit(price ~ longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Auraria + neighborhood_Belcaro + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_nochees_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate neighborhood_Belcaro
#abb_reg_lasso_nobelcaro_fit <- reg_spec %>%
# fit(price ~ longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Auraria + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_nobelcaro_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate neighborhood_Auraria
#abb_reg_lasso_noaur_fit <- reg_spec %>%
# fit(price ~ longitude + min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_noaur_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate longitude
#abb_reg_lasso_nolong_fit <- reg_spec %>%
# fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Goldsmith + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_nolong_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate neighborhood_Goldsmith
#abb_reg_lasso_nogold_fit <- reg_spec %>%
# fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Park.West + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_nogold_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate neighborhood_Washington.Park.West
#abb_reg_lasso_nowashw_fit <- reg_spec %>%
# fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Berkeley + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_nowashw_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate neighborhood_Berkeley
#abb_reg_lasso_noberk_fit <- reg_spec %>%
# fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Overland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_noberk_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate neighborhood_Overland
#abb_reg_lasso_noover_fit <- reg_spec %>%
# fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Highland + neighborhood_Lowry.Field + neighborhood_Mar.Lee + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_noover_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate neighborhood_Lowry.Field
#abb_reg_lasso_nolow_fit <- reg_spec %>%
# fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Highland + neighborhood_Mar.Lee + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_nolow_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate neighborhood_Mar.Lee
#abb_reg_lasso_nomar_fit <- reg_spec %>%
# fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_Washington.Virginia.Vale + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_nomar_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate neighborhood_Washington.Virginia.Vale
#abb_reg_lasso_nowav_fit <- reg_spec %>%
# fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_Capitol.Hill + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_nowav_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate neighborhood_Capitol.Hill
#abb_reg_lasso_nocap_fit <- reg_spec %>%
# fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station + neighborhood_West.Highland + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_nocap_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate neighborhood_West.Highland
#abb_reg_lasso_nowhi_fit <- reg_spec %>%
# fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + host_in_denver_yes + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_nowhi_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate host_in_denver_yes
#abb_reg_lasso_noden_fit <- reg_spec %>%
# fit(price ~ min_nights + max_nights + review_scores_rating + since_first_review + num_bath + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_noden_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate since_first_review
#abb_reg_lasso_nofirrev_fit <- reg_spec %>%
# fit(price ~ min_nights + max_nights + review_scores_rating + num_bath + neighborhood_CBD + neighborhood_City.Park.West + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station + bath_type_shared, data = abb_reg_train_norm2)
#tidy(abb_reg_lasso_nofirrev_fit$fit) %>%
# kable(digits=3)
```
```{r}
#eliminate neighborhood_City.Park.West
abb_reg_lasso_nocityw_fit <- reg_spec %>%
fit(price ~ min_nights + max_nights + review_scores_rating + num_bath + neighborhood_CBD + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station + bath_type_shared, data = abb_reg_train_norm2)
```
Column {data-width=400, data-height=550}
-----------------------------------------------------------------------
#### Results of linear regression after backward elimination of insignificant predictors
The final equation for the lasso regression is: price ~ min_nights + max_nights + review_scores_rating + num_bath + neighborhood_CBD + neighborhood_Civic.Center + neighborhood_Cole + neighborhood_Gateway...Green.Valley.Ranch + neighborhood_Highland + neighborhood_Stapleton + neighborhood_Union.Station + bath_type_shared
```{r, cache=TRUE}
tidy(abb_reg_lasso_nocityw_fit$fit) %>%
kable(digits=3)
```
```{r}
abb_reg_lasso_final_fit <-abb_reg_lasso_nocityw_fit
pred_abb_reg_lasso_final_fit <- abb_reg_lasso_final_fit %>%
augment(abb_reg_train_norm2)
curr_reg_metrics <- pred_abb_reg_lasso_final_fit %>%
my_reg_metrics(truth=price, estimate = .pred)
# adding results to comparison table
results_abb_reg_new <-tibble(model = 'reg_lasso',
rmse = curr_reg_metrics[[1,3]],
mae = curr_reg_metrics[[2,3]],
rsq = curr_reg_metrics[[3,3]])
results_abb_reg <- bind_rows(results_abb_reg, results_abb_reg_new)
```
Column {data-width=400, data-height=550}
-----------------------------------------------------------------------
#### VIP Plot
```{r, cache=TRUE}
vip(abb_reg_lasso_final_fit)
```
Column {data-width=400, data-height=200}
-----------------------------------------------------------------------
#### Model Metrics Comparison
```{r}
results_abb_reg %>%
kable(digits=3)
```
Column {data-width=400, data-height=650}
--------------------------------------------------------------------------
#### Actual vs. Predicted model comparisons
```{r, cache=TRUE}
reg_pred <- bind_rows(pred_reg_all_pred_fit %>%
mutate(model = "Simple Regression"),
pred_abb_reg_lasso_final_fit %>%
mutate(model = "Lasso Regression"))
reg_pred %>%
ggplot(aes(x=price, y=.pred, col=model)) +
geom_point(alpha=.40) +
xlab("Actual Price") +
ylab("Predicted Price") +
xlim(c(0,2625)) +
geom_abline(col="gold") +
ggtitle("Comparing Simple Regression and Lasso Regression Models")
```
Lasso-VS {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=520}
-----------------------------------------------------------------------
### Lasso Model for Categorical Value Score Variable
Because of the large number of predictors in the dataset, lasso regression was used for narrowing down the number of predictors for the categorical variable (value_score_cat). The two categories are Excellent (a perfect score) and Other.
We can see that the lasso regression narrowed the number of predictors down to three, and all three relate to customer reviews. Value score is a rating that indicates the extent to which the customer believes that the price they paid was worth the experience they had.
The most significant predictor was the number of reviews in the last 12 months. Its positive value indicates that a larger number of reviews made an Excellent score more likely. The second most significant predictor was the review scores rating. This is the overall rating that customers give. Interestingly, it is a negative number, meaning that higher overall ratings make an Excellent value score less likely. The third predictor is the number of months since the property's first review. This positive value shows that the longer a property has been an airbnb, the more likely it is to receive an Excellent value score.
Because the calculated R-square value is .464, we can see that these three predictors account for not-quite-half of the variability in the data. As the other predictors were eliminated in the lasso regression, we can infer that there important factors which are not captured in the data that is being measured. The metrics comparison shows that the regular logistic regression model with all of the predictors included is more accurate overall and has higher sensitivity, specificity, and auc. Finally, we can see clearly see the difference in the models' auc values on the ROC curve.
Column {data-width=400, data-height=110}
-----------------------------------------------------------------------
### Predicting Value Score
The lasso model yielded all significant predictors so no backward elimination process was needed.
-----------------------------------------------------------------------
```{r}
#categorical variable splitting into training and testing datasets
abb_cat_recipe <- recipe(value_score_cat ~ ., data = abb_cat) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors()) %>%
prep()
abb_cat_norm <- bake(abb_cat_recipe, abb_cat)
#names(abb_cat_norm)
set.seed(12934)
abb_cat_split_norm <- initial_split(abb_cat_norm, prop = .7, strata=value_score_cat)
abb_cat_train_norm <- rsample::training(abb_cat_split_norm)
abb_cat_test_norm <- rsample::testing(abb_cat_split_norm)
```
```{r}
#metrics
my_abb_cat_metrics <- metric_set(yardstick::accuracy,
yardstick::sensitivity,
yardstick::specificity)
```
```{r}
#folds
abb_cat_grid <- tibble(penalty=seq(.1, 50, len = 500))
abb_cat_fold <- vfold_cv(abb_cat_train_norm, v=5, strata=value_score_cat)
```
```{r}
#Define Model Specifications
abb_cat_lassotune_spec <- logistic_reg(penalty = tune(),
mixture = 1) %>%
set_engine("glmnet") %>%
set_mode("classification")
#Create the workflow, add the recipe and tune on penalty
abb_cat_lassotune_wf <- workflow() %>%
add_model(abb_cat_lassotune_spec) %>%
add_formula(value_score_cat ~ .)
abb_cat_lassotune_rs <- abb_cat_lassotune_wf %>%
tune_grid(resamples = abb_cat_fold,
grid =abb_cat_grid) #,
#metrics = my_abb_cat_metrics)
lowest_roc_auc_lasso <- abb_cat_lassotune_rs %>%
select_best("roc_auc", penalty)
abb_cat_final_lasso <- abb_cat_lassotune_wf %>%
finalize_workflow(lowest_roc_auc_lasso)
abb_cat_final_lasso_fit <- abb_cat_final_lasso %>%
fit(abb_cat_train_norm)
pred_abb_cat_final_lasso_fit <- abb_cat_final_lasso_fit %>%
augment(abb_cat_train_norm)
```
```{r}
#print(paste('The lowest ROC_AUC Lasso penalty is',lowest_roc_auc_lasso$penalty))
```
```{r}
#abb_cat_final_lasso_fit %>%
# extract_fit_parsnip() %>%
# tidy() %>%
#kable()
```
```{r}
#abb_cat_final_lasso_fit %>%
# extract_fit_parsnip() %>%
# tidy() %>%
# filter(estimate != 0) %>%
#kable()
```
```{r}
#my_abb_cat_metrics <- metric_set(yardstick::accuracy,yardstick::sensitivity,
# yardstick::specificity)
#abb_cat_curr_metrics <- pred_abb_cat_final_lasso_fit %>%
# my_abb_cat_metrics(truth = value_score_cat, estimate = .pred_class)
#curr_auc <- pred_abb_cat_final_lasso_fit %>%
# roc_auc(truth = value_score_cat, estimate = .pred_excellent) %>%
# pull(.estimate)
#results_abb_cat_new <- tibble(model = 'first_lasso',
# accuracy = abb_cat_curr_metrics[[1,3]],
# sensitivity = abb_cat_curr_metrics[[2,3]],
# specificity = abb_cat_curr_metrics[[3,3]],
# auc = round(curr_auc,2))
#results_abb_cat <-bind_rows(results_abb_cat, results_abb_cat_new)
```
```{r}
# create new normalized data and splits with smaller variable list
abb_cat_norm2 <- abb_cat_norm %>%
dplyr::select(value_score_cat, number_of_reviews_ltm, review_scores_rating, since_first_review)
set.seed(12938)
abb_cat_split_norm2 <- initial_split(abb_cat_norm2, prop = .7, strata=value_score_cat)
abb_cat_train_norm2 <- rsample::training(abb_cat_split_norm2)
abb_cat_test_norm2 <- rsample::testing(abb_cat_split_norm2)
```
```{r}
# new folds for lasso dataset
abb_cat_fold2 <- vfold_cv(abb_cat_train_norm2, v=5, strata=value_score_cat)
log_spec <- logistic_reg() %>%
set_engine('glm') %>%
set_mode('classification')
log_lasso_full_fit <- log_spec %>%
fit(value_score_cat ~ ., data = abb_cat_train_norm2)
#summary(log_lasso_full_fit$fit)
```
```{r}
#Column {data-width=400, data-height=100}
#-----------------------------------------------------------------------
#### A Designation?
#```{r, cache=TRUE}
#glance(log_lasso_full_fit$fit) %>%
# kable(digits=3)
```
Column {data-width=400, data-height=300}
-----------------------------------------------------------------------
#### Results of logistic regression using lasso predictors
The final equation for the lasso regression is: value_score_cat ~ number_of_reviews_ltm + review_scores_rating + since_first_review
```{r, cache=TRUE}
tidy(log_lasso_full_fit$fit) %>%
kable(digits=3)
```
Column {data-width=400, data-height=150}
-----------------------------------------------------------------------
#### Calculated R-square
```{r, cache=TRUE}
#find R^2
dev <- glance(log_lasso_full_fit$fit) %>%
pull(deviance)
null_dev <-glance(log_lasso_full_fit$fit) %>%
pull(null.deviance)
lasso_rsq <- tibble(Measure="R-square",
Value = 1 - (dev/null_dev))
lasso_rsq %>%
kable(digits=3)
```
Column {data-width=400, data-height=170}
-----------------------------------------------------------------------
#### Confusion Matrix
```{r, cache=TRUE}
#confusion matrix
pred_log_lasso_full <- log_lasso_full_fit %>%
augment(abb_cat_test_norm)
pred_log_lasso_full %>%
conf_mat(truth = value_score_cat, estimate = .pred_class)
```
```{r}
#metrics
my_abb_cat_metrics <- metric_set(yardstick::accuracy,yardstick::sensitivity,
yardstick::specificity)
abb_cat_curr_metrics <- pred_log_lasso_full %>%
my_abb_cat_metrics(truth = value_score_cat, estimate = .pred_class)
curr_auc <- pred_log_lasso_full %>%
roc_auc(truth = value_score_cat, estimate = .pred_excellent) %>%
pull(.estimate)
results_abb_cat_new <- tibble(model = 'log_lasso_full',
accuracy = abb_cat_curr_metrics[[1,3]],
sensitivity = abb_cat_curr_metrics[[2,3]],
specificity = abb_cat_curr_metrics[[3,3]],
auc = round(curr_auc,2))
results_abb_cat <-bind_rows(results_abb_cat, results_abb_cat_new)
```
Column {data-width=400, data-height=220}
-----------------------------------------------------------------------
#### Model Metrics Comparison
```{r}
results_abb_cat %>%
kable(digits = 3)
```
Column {data-width=400, data-height=520}
-----------------------------------------------------------------------
#### VIP Plot
```{r, cache=TRUE}
vip(log_lasso_full_fit)
```
```{r}
# ROC
#Capture the thresholds and sens/spec
abb_cat_roc <- bind_rows(abb_cat_roc,
pred_log_lasso_full %>%
roc_curve(truth = value_score_cat, estimate=.pred_excellent) %>%
mutate(model = paste('log_lasso_full',round(curr_auc,2))))
```
Column {data-width=400, data-height=590}
-----------------------------------------------------------------------
#### ROC Curve
```{r, cache=TRUE}
#Plot the ROC Curve(s)
ggplot(abb_cat_roc,
aes(x = 1 - specificity, y = sensitivity,
group = model, col = model)) +
geom_path() +
geom_abline(lty = 3) +
scale_color_brewer(palette = "Dark2") +
theme(legend.position = "top")
```
Rand.For.-P {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=520}
----------------------------------------------------------------------
### Random Forest Model for Price Variable
Two random forest models were run: one with all of the available predictors and one using only the predictors resulting from the lasso regression. For both of the models I tried the following parameters:
* **Number of variables to include (mtry)**: 2,3,4,5
* **Number of trees (trees)**: 500, 1000
* **Minimum number of records per leaf (min_n)**: 5, 10, 15, 20
* **Maximum depth of tree (max.depth)**: 5, 6, 7, 8
We can see that the best model for all predictors had these arguments:
mtry = 5
trees = 500
min_n = 20
max.depth = 8
It did slightly better than the regular or lasso regressions in terms of R-square, but it had markedly higher error values.
The best model for the lasso predictor-only model had these arguments:
mtry = 5
trees = 1000
min_n = 10
max.depth = 8
It had the best R-square of all of the models at .484, and its rmse and mae values were much lower than those of the initial random forest model, though still higher than the two regression models.
We see from the two graphs of actual and predicted values that the random forest models - particularly the lasso one - do better at predicting the lower price properties.
The VIP plots also show the number of bathrooms as the most important predictor by far. This has - unsurprisingly - been the case for all of the models.
Column {data-width=400, data-height=520}
-----------------------------------------------------------------------
### Part 1: Using all predictors
```{r}
#Random Forest for Predicting Price
#Part 1: using all predictors
#use original folds
rf_grid <- expand_grid(mtry = 2:5,
trees = c(500, 1000),
min_n = c(5,10,15,20),
max.depth = c(5,6,7,8))
rf_reg_tune_spec <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger",
importance = "impurity",
max.depth = tune()) %>%
set_mode("regression")
reg_tree_wf <- workflow() %>%
add_model(rf_reg_tune_spec) %>%
add_formula(price ~ .)
reg_tree_full_rs <- reg_tree_wf %>%
tune_grid(resamples = abb_reg_fold,
grid = rf_grid)
#finalize workflow
final_reg_tree_full_wf <- reg_tree_wf %>%
finalize_workflow(select_best(reg_tree_full_rs))
final_reg_tree_full_wf
```
```{r}
#fit the model
set.seed(1996)
final_reg_tree_full_fit<-final_reg_tree_full_wf %>%
fit(data=abb_reg_train_norm)
#calculate metrics
pred_final_reg_tree_full <- final_reg_tree_full_fit%>%
augment(abb_reg_test_norm)
```
```{r}
# adding results to comparison table
curr_reg_metrics <- pred_final_reg_tree_full%>%
my_reg_metrics(truth=price, estimate=.pred)
results_abb_reg_new <-tibble(model = 'reg_full_rf',
rmse = curr_reg_metrics[[1,3]],
mae = curr_reg_metrics[[2,3]],
rsq = curr_reg_metrics[[3,3]])
results_abb_reg <- bind_rows(results_abb_reg, results_abb_reg_new)
```
Column {data-width=400, data-height=220}
-----------------------------------------------------------------------
#### Model Metrics Comparison
```{r, cache=TRUE}
results_abb_reg %>%
kable(digits=3)
```
Column {data-width=400, data-height=520}
-----------------------------------------------------------------------
#### VIP Plot
```{r, cache=TRUE}
#VIP plot
final_reg_tree_full_fit %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(fill = "#6e0000", col = "black"))
```
Column {data-width=400, data-height=520}
-----------------------------------------------------------------------
### Part 2: Using Lasso predictors only
```{r}
#Part 2: using lasso predictors only
# re-use grid, model specification, workflow from Part 1
reg_tree_lasso_rs <- reg_tree_wf%>%
tune_grid(resamples = abb_reg_fold2,
grid = rf_grid)
#finalize workflow
final_reg_tree_lasso_wf <- reg_tree_wf %>%
finalize_workflow(select_best(reg_tree_lasso_rs))
final_reg_tree_lasso_wf
```
```{r}
#fit the model
set.seed(1998)
final_reg_tree_lasso_fit<-final_reg_tree_lasso_wf %>%
fit(data=abb_reg_train_norm2)
#calculate metrics
pred_final_reg_tree_lasso <- final_reg_tree_lasso_fit%>%
augment(abb_reg_test_norm2)
```
```{r}
# adding results to comparison table
curr_reg_metrics <- pred_final_reg_tree_lasso%>%
my_reg_metrics(truth=price, estimate=.pred)
results_abb_reg_new <-tibble(model = 'reg_lasso_rf',
rmse = curr_reg_metrics[[1,3]],
mae = curr_reg_metrics[[2,3]],
rsq = curr_reg_metrics[[3,3]])
results_abb_reg <- bind_rows(results_abb_reg, results_abb_reg_new)
```
Column {data-width=400, data-height=220}
-----------------------------------------------------------------------
#### Model Metrics Comparison
```{r, cache=TRUE}
results_abb_reg %>%
kable(digits=3)
```
Column {data-width=400, data-height=520}
-----------------------------------------------------------------------
#### VIP Plot
```{r, cache=TRUE}
#VIP plot
final_reg_tree_lasso_fit %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(fill = "#6e0000", col = "black"))
```
Column {data-width=400, data-height=1300}
--------------------------------------------------------------------------
#### Actual vs. Predicted model comparisons
```{r, cache=TRUE}
reg_pred <- bind_rows(pred_final_reg_tree_full %>%
mutate(model="Full Random Forest"),
pred_final_reg_tree_lasso %>%
mutate(model="Lasso Random Forest"))
reg_pred %>%
ggplot(aes(x=price, y=.pred, col=model)) +
geom_point(alpha=.40) +
xlab("Actual Price") +
ylab("Predicted Price") +
xlim(c(0,2625)) +
geom_abline(col="gold") +
ggtitle("Comparing Full and Lasso Random Forest Models")
```
```{r, cache=TRUE}
reg_pred <- bind_rows(pred_reg_all_pred_fit %>%
mutate(model = "Simple Regression"),
pred_abb_reg_lasso_final_fit %>%
mutate(model = "Lasso Regression"),
pred_final_reg_tree_full %>%
mutate(model="Full Random Forest"),
pred_final_reg_tree_lasso %>%
mutate(model="Lasso Random Forest"))
reg_pred %>%
ggplot(aes(x=price, y=.pred, col=model)) +
geom_point(alpha=.40) +
xlab("Actual Price") +
ylab("Predicted Price") +
xlim(c(0,2625)) +
geom_abline(col="gold") +
ggtitle("Comparing Regression and Random Forest Models")
```
Rand.For.-VS {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=520}
-------------------------------------
### Random Forest Model for Value Score Variable
Two random forest models were run: one with all of the available predictors and one using only the predictors resulting from the lasso regression. For the model using all predictors I tried the following parameters:
* **Number of variables to include (mtry)**: 2,3,4,5
* **Number of trees (trees)**: 500, 1000
* **Minimum number of records per leaf (min_n)**: 5, 10, 15, 20
* **Maximum depth of tree (max.depth)**: 5, 6, 7, 8
Because there were only three variables remaining after the lasso regression the second tree model had this change:
* **Number of variables to include**: 2,3
We can see that the best model for all predictors had these arguments:
mtry = 4
trees = 500
min_n = 10
max.depth = 8
It did better than the regular or lasso logistic regressions in terms of specificity. It did better than the lasso regression for accuracy and auc, but it did the worst - by far - on sensitivity.
The best model for the lasso predictor-only model had these arguments:
mtry = 2
trees = 1000
min_n = 20
max.depth = 5
With only 3 predictors, it is not surprising that the max depth is less and the min_n is more.
With an auc of .96, the random forest model with lasso predictors tied the auc of the full logistic regression model with all predictors. The lasso random forest did almost as well as the full logistic regression model. The ROC curve shows these comparisons nicely.
All the important predictors for both models are all related to review scores, and the most important predictor according to both models is review_scores_rating, which is the overall rating that a customer gives. For the model with all predictors, the number of reviews is a close second.
Column {data-width=400, data-height=520}
-----------------------------------------------------------------------
### Part 1: Using all predictors
```{r}
#Random Forest for Predicting Value Score Category
#Part 1: using all predictors
#use original folds and random forest grid previously defined
rf_cat_tune_spec <- rand_forest(mtry = tune(),
trees = tune(),
min_n=tune()) %>%
set_engine("ranger",
importance = "impurity",
max.depth = tune()) %>%
set_mode("classification")
cat_tree_wf <- workflow() %>%
add_model(rf_cat_tune_spec) %>%
add_formula(value_score_cat ~ .)
cat_tree_full_rs <- cat_tree_wf %>%
tune_grid(resamples = abb_cat_fold,
grid = rf_grid)
#finalize workflow
final_cat_tree_full_wf <- cat_tree_wf %>%
finalize_workflow(select_best(cat_tree_full_rs))
final_cat_tree_full_wf
```
```{r}
#fit the model
set.seed(1996)
final_cat_tree_full_fit<-final_cat_tree_full_wf %>%
fit(data=abb_cat_train_norm)
#calculate metrics
pred_final_cat_tree_full <- final_cat_tree_full_fit%>%
augment(abb_cat_test_norm)
```
Column {data-width=400, data-height=170}
-----------------------------------------------------------------------
#### Confusion Matrix
```{r, cache=TRUE}
#confusion matrix
pred_final_cat_tree_full %>%
conf_mat(truth = value_score_cat, estimate = .pred_class)
```
```{r}
# adding results to comparison table
abb_cat_curr_metrics <- pred_final_cat_tree_full%>%
my_abb_cat_metrics(truth=value_score_cat, estimate=.pred_class)
curr_auc <- pred_final_cat_tree_full %>%
roc_auc(truth = value_score_cat, estimate = .pred_excellent) %>%
pull(.estimate)
results_abb_cat_new <- tibble(model = 'cat_full_rf',
accuracy = abb_cat_curr_metrics[[1,3]],
sensitivity = abb_cat_curr_metrics[[2,3]],
specificity = abb_cat_curr_metrics[[3,3]],
auc = round(curr_auc,2))
results_abb_cat <-bind_rows(results_abb_cat, results_abb_cat_new)
```
Column {data-width=400, data-height=220}
-----------------------------------------------------------------------
#### Model Metrics Comparison
```{r, cache=TRUE}
results_abb_cat %>%
kable(digits = 3)
```
Column {data-width=400, data-height=520}
-----------------------------------------------------------------------
#### VIP Plot
```{r, cache=TRUE}
#VIP plot
final_cat_tree_full_fit %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(fill = "#6e0000", col = "black"))
```
```{r}
# ROC
#Capture the thresholds and sens/spec
abb_cat_roc <- bind_rows(abb_cat_roc,
pred_final_cat_tree_full %>%
roc_curve(truth = value_score_cat, estimate=.pred_excellent) %>%
mutate(model = paste('rf_tree_full',round(curr_auc,2))))
```
Column {data-width=400, data-height=590}
-----------------------------------------------------------------------
#### ROC Curve
```{r, cache=TRUE}
#Plot the ROC Curve(s)
ggplot(abb_cat_roc,
aes(x = 1 - specificity, y = sensitivity,
group = model, col = model)) +
geom_path() +
geom_abline(lty = 3) +
scale_color_brewer(palette = "Dark2") +
theme(legend.position = "top")
```
Column {data-width=400, data-height=520}
-----------------------------------------------------------------------
### Part 2: Using Lasso predictors only
```{r}
#Part 2: using lasso predictors only
# re-use model specification, workflow from Part 1
#specify new grid b/c only 3 predictors captured by lasso
rf_grid <- expand_grid(mtry = c(2,3),
trees = c(500, 1000),
min_n = c(5,10,15,20),
max.depth = c(5,6,7,8))
cat_tree_lasso_rs <- cat_tree_wf%>%
tune_grid(resamples = abb_cat_fold2,
grid = rf_grid)
#finalize workflow
final_cat_tree_lasso_wf <- cat_tree_wf %>%
finalize_workflow(select_best(cat_tree_lasso_rs))
final_cat_tree_lasso_wf
```
```{r}
#fit the model
set.seed(1998)
final_cat_tree_lasso_fit<-final_cat_tree_lasso_wf %>%
fit(data=abb_cat_train_norm2)
#calculate metrics
pred_final_cat_tree_lasso <- final_cat_tree_lasso_fit%>%
augment(abb_cat_test_norm2)
```
Column {data-width=400, data-height=170}
-----------------------------------------------------------------------
#### Confusion Matrix
```{r, cache=TRUE}
#confusion matrix
pred_final_cat_tree_lasso %>%
conf_mat(truth = value_score_cat, estimate = .pred_class)
```
```{r}
# adding results to comparison table
abb_cat_curr_metrics <- pred_final_cat_tree_lasso%>%
my_abb_cat_metrics(truth=value_score_cat, estimate=.pred_class)
curr_auc <- pred_final_cat_tree_lasso %>%
roc_auc(truth = value_score_cat, estimate = .pred_excellent) %>%
pull(.estimate)
results_abb_cat_new <- tibble(model = 'cat_lasso_rf',
accuracy = abb_cat_curr_metrics[[1,3]],
sensitivity = abb_cat_curr_metrics[[2,3]],
specificity = abb_cat_curr_metrics[[3,3]],
auc = round(curr_auc,2))
results_abb_cat <-bind_rows(results_abb_cat, results_abb_cat_new)
```
Column {data-width=400, data-height=220}
-----------------------------------------------------------------------
#### Model Metrics Comparison
```{r, cache=TRUE}
results_abb_cat %>%
kable(digits = 3)
```
Column {data-width=400, data-height=520}
-----------------------------------------------------------------------
#### VIP Plot
```{r, cache=TRUE}
#VIP plot
final_cat_tree_lasso_fit %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(fill = "#6e0000", col = "black"))
```
```{r}
# ROC
#Capture the thresholds and sens/spec
abb_cat_roc <- bind_rows(abb_cat_roc,
pred_final_cat_tree_lasso %>%
roc_curve(truth = value_score_cat, estimate=.pred_excellent) %>%
mutate(model = paste('rf_tree_lasso',round(curr_auc,2))))
```
Column {data-width=400, data-height=590}
-----------------------------------------------------------------------
#### ROC Curve
```{r, cache=TRUE}
#Plot the ROC Curve(s)
ggplot(abb_cat_roc,
aes(x = 1 - specificity, y = sensitivity,
group = model, col = model)) +
geom_path() +
geom_abline(lty = 3) +
scale_color_brewer(palette = "Dark2") +
theme(legend.position = "top")
```
Conclusion {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=520}
-----------------------------------------------------------------------
### Project Conclusion
**Price**
Number of baths is the most important predictor of price, by a huge margin. Number of baths is a good proxy for property size: more baths suggests larger property. Price being higher for a larger property makes sense.
If we want to find out what can be done to charge a higher price for a property, though, we need to look to the VIP plot (for relative importance) and the regression coefficients (to see if they are positive or negative, in this case). The chosen VIP plot is from the best-performing model overall; the coefficients are from the best regression model. Though the two models don't agree on the ranking, we can still gain insight.
The three most important factors that a host could change are minimum nights, review scores rating, and maximum nights. We see from the regression coefficients that having a lower number of minimum nights, a higher number of maximum nights, and a higher review scores rating all increase price. The takeaway for a host is to be flexible (allow for both shorter and longer stays) and to work hard to please the customers.
Regarding the models themselves, we see from the metrics comparison and the actual vs predicted plot that the random forest model using the lasso predictors does the best at explaining the data, especially at the lower end of the price scale. The wide range of prices makes it difficult to predict the ones at the higher end of the scale. This is likely why none of the models were able to explain even half of the variation in the data (the highest R-square was .484).
--------------------------------------------------------------------
**Value Score**
Review scores rating is the most important predictor of value score, by far, according to the random forest model that uses only lasso predictors. We can see that this model has the highest auc value and the second-best accuracy, sensitivity, and specificity values. I chose this model because it is simpler than the logistic regression model with all predictors (and because only 22 of the 105 normalized predictors from that model had p-values < .1).
The coefficients from the logistic regression with lasso predictors give us the additional information that review scores rating is **negatively** correlated with value score. This would mean that higher-rated properties overall are rated as having lower value for the price. The other two significant predictors from the lasso regression are the number of reviews in the last 12 months and the number of months since the first review of the property. These are both positively correlated to value score, which means that properties that have been airbnbs longer and which have had more customers recently receive better ratings for value.
The takeaway from the negative correlation between review scores rating and value score may be that, even though customers have a great overall experience at an airbnb, they may believe that it was overpriced.
More research should be done before offering airbnb hosts advice on this matter, however.
Column {data-width=400, data-height=220}
-----------------------------------------------------------------------
#### Price Model Metrics Comparison
```{r, cache=TRUE}
results_abb_reg %>%
kable(digits=3)
```
Column {data-width=400, data-height=520}
-----------------------------------------------------------------------
#### VIP Plot for Random Forest Model with Lasso Predictors
```{r, cache=TRUE}
#VIP plot
final_reg_tree_lasso_fit %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(fill = "#6e0000", col = "black"))
```
Column {data-width=400, data-height=650}
--------------------------------------------------------------------------
#### Actual vs. Predicted Price model comparisons
```{r, cache=TRUE}
reg_pred <- bind_rows(pred_reg_all_pred_fit %>%
mutate(model = "Simple Regression"),
pred_abb_reg_lasso_final_fit %>%
mutate(model = "Lasso Regression"),
pred_final_reg_tree_full %>%
mutate(model="Full Random Forest"),
pred_final_reg_tree_lasso %>%
mutate(model="Lasso Random Forest"))
reg_pred %>%
ggplot(aes(x=price, y=.pred, col=model)) +
geom_point(alpha=.40) +
xlab("Actual Price") +
ylab("Predicted Price") +
xlim(c(0,2625)) +
geom_abline(col="gold") +
ggtitle("Comparing Regression and Random Forest Models")
```
Column {data-width=400, data-height=550}
-----------------------------------------------------------------------
#### Linear regression with lasso predictors
```{r, cache=TRUE}
tidy(abb_reg_lasso_nocityw_fit$fit) %>%
kable(digits=3)
```
Column {data-width=400, data-height=220}
-----------------------------------------------------------------------
#### Value Score Model Metrics Comparison
```{r, cache=TRUE}
results_abb_cat %>%
kable(digits = 3)
```
Column {data-width=400, data-height=520}
-----------------------------------------------------------------------
#### VIP Plot for Random Forest Model with Lasso Predictors
```{r, cache=TRUE}
#VIP plot
final_cat_tree_lasso_fit %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(fill = "#6e0000", col = "black"))
```
Column {data-width=400, data-height=590}
-----------------------------------------------------------------------
#### Value Score ROC Curves
```{r, cache=TRUE}
#Plot the ROC Curve(s)
ggplot(abb_cat_roc,
aes(x = 1 - specificity, y = sensitivity,
group = model, col = model)) +
geom_path() +
geom_abline(lty = 3) +
scale_color_brewer(palette = "Dark2") +
theme(legend.position = "top")
```
Column {data-width=400, data-height=300}
-----------------------------------------------------------------------
#### Logistic regression with lasso predictors
```{r, cache=TRUE}
tidy(log_lasso_full_fit$fit) %>%
kable(digits=3)
```
Reflection {data-orientation=rows}
=======================================================================
Row {data-height=1250}
-----------------------------------------------------------------------
### Reflection
**Most Proud**
I am most proud of my data cleaning and variable transformations. I worked hard to make sure that all of the necessary transformations were done in R so that I could import the data straight in from the InsideAirbnb.com website at any time. Some of the code is clunky (I used a massive case statement to recode the bathroom variable), but it works!
**If I Had More Time**
If I had another week I would try a log transformation on the price variable. It's just so skewed. I don't think it would change the fact that number of bathrooms (property size, really) is the dominant factor for price, but it might allow for more insight about what else matters.
For the categorical variable, I would want to try some regular trees using all the predictors. It's interesting that the overall customer rating is negatively correlated with the value score rating. I'm curious if other score ratings would be positively or negatively correlated to it.